[SPARK-27685] `union` doesn't promote non-nullable columns of struct to nullable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.0
Fix Version/s: None
Component/s: SQL
Labels:
- correctness

Description

When doing a union of two dataframes, a column that is nullable in one of the dataframes will be nullable in the union, promoting the non-nullable one to be nullable.

This doesn't happen properly for columns nested as subcolumns of a struct. It seems to just take the nullability of the first dataframe in the union, meaning a nullable column will become non-nullable, resulting in invalid values.

case class X(x: Option[Long])
case class Nested(nested: X)

// First, just work with normal columns
val df1 = Seq(1L, 2L).toDF("x")
val df2 = Seq(Some(3L), None).toDF("x")

df1.printSchema
// root
//  |-- x: long (nullable = false)

df2.printSchema
// root
//  |-- x: long (nullable = true)

(df1 union df2).printSchema
// root
//  |-- x: long (nullable = true)

(df1 union df2).as[X].collect
// res19: Array[X] = Array(X(Some(1)), X(Some(2)), X(Some(3)), X(None))

(df1 union df2).select("*").show
// +----+
// |   x|
// +----+
// |   1|
// |   2|
// |   3|
// |null|
// +----+

// Now, the same with the 'x' column within a struct:

val struct1 = df1.select(struct('x) as "nested")
val struct2 = df2.select(struct('x) as "nested")

struct1.printSchema
// root
//  |-- nested: struct (nullable = false)
//  |    |-- x: long (nullable = false)

struct2.printSchema
// root
//  |-- nested: struct (nullable = false)
//  |    |-- x: long (nullable = true)

// BAD: the x column is not nullable
(struct1 union struct2).printSchema
// root
//  |-- nested: struct (nullable = false)
//  |    |-- x: long (nullable = false)

// BAD: the last x value became "Some(0)", instead of "None"
(struct1 union struct2).as[Nested].collect
// res23: Array[Nested] = Array(Nested(X(Some(1))), Nested(X(Some(2))), Nested(X(Some(3))), Nested(X(Some(0))))

// BAD: showing just the nested columns throws a NPE
(struct1 union struct2).select("nested.*").show
// java.lang.NullPointerException
//  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
//  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3387)
// ...
//  at org.apache.spark.sql.Dataset.show(Dataset.scala:714)
//  ... 49 elided


// Flipping the order makes x nullable as desired
(struct2 union struct1).printSchema
// root
//  |-- nested: struct (nullable = false)
//  |    |-- x: long (nullable = true)
(struct2 union struct1).as[Y].collect
// res26: Array[Y] = Array(Y(X(Some(3))), Y(X(None)), Y(X(Some(1))), Y(X(Some(2))))

(struct2 union struct1).select("nested.*").show
// +----+
// |   x|
// +----+
// |   3|
// |null|
// |   1|
// |   2|
// +----+

Note the three BAD lines, where the union of structs became non-nullable and resulted in invalid values and exceptions.

Attachments

Issue Links

duplicates

SPARK-26812 PushProjectionThroughUnion nullability issue

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Huon Wilson

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/May/19 04:45

Updated:: 02/Mar/20 20:20

Resolved:: 14/May/19 10:20