Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.0
-
None
Description
When doing a union of two dataframes, a column that is nullable in one of the dataframes will be nullable in the union, promoting the non-nullable one to be nullable.
This doesn't happen properly for columns nested as subcolumns of a struct. It seems to just take the nullability of the first dataframe in the union, meaning a nullable column will become non-nullable, resulting in invalid values.
case class X(x: Option[Long]) case class Nested(nested: X) // First, just work with normal columns val df1 = Seq(1L, 2L).toDF("x") val df2 = Seq(Some(3L), None).toDF("x") df1.printSchema // root // |-- x: long (nullable = false) df2.printSchema // root // |-- x: long (nullable = true) (df1 union df2).printSchema // root // |-- x: long (nullable = true) (df1 union df2).as[X].collect // res19: Array[X] = Array(X(Some(1)), X(Some(2)), X(Some(3)), X(None)) (df1 union df2).select("*").show // +----+ // | x| // +----+ // | 1| // | 2| // | 3| // |null| // +----+ // Now, the same with the 'x' column within a struct: val struct1 = df1.select(struct('x) as "nested") val struct2 = df2.select(struct('x) as "nested") struct1.printSchema // root // |-- nested: struct (nullable = false) // | |-- x: long (nullable = false) struct2.printSchema // root // |-- nested: struct (nullable = false) // | |-- x: long (nullable = true) // BAD: the x column is not nullable (struct1 union struct2).printSchema // root // |-- nested: struct (nullable = false) // | |-- x: long (nullable = false) // BAD: the last x value became "Some(0)", instead of "None" (struct1 union struct2).as[Nested].collect // res23: Array[Nested] = Array(Nested(X(Some(1))), Nested(X(Some(2))), Nested(X(Some(3))), Nested(X(Some(0)))) // BAD: showing just the nested columns throws a NPE (struct1 union struct2).select("nested.*").show // java.lang.NullPointerException // at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source) // at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3387) // ... // at org.apache.spark.sql.Dataset.show(Dataset.scala:714) // ... 49 elided // Flipping the order makes x nullable as desired (struct2 union struct1).printSchema // root // |-- nested: struct (nullable = false) // | |-- x: long (nullable = true) (struct2 union struct1).as[Y].collect // res26: Array[Y] = Array(Y(X(Some(3))), Y(X(None)), Y(X(Some(1))), Y(X(Some(2)))) (struct2 union struct1).select("nested.*").show // +----+ // | x| // +----+ // | 3| // |null| // | 1| // | 2| // +----+
Note the three BAD lines, where the union of structs became non-nullable and resulted in invalid values and exceptions.
Attachments
Issue Links
- duplicates
-
SPARK-26812 PushProjectionThroughUnion nullability issue
- Resolved