When doing a union of two dataframes, a column that is nullable in one of the dataframes will be nullable in the union, promoting the non-nullable one to be nullable.
This doesn't happen properly for columns nested as subcolumns of a struct. It seems to just take the nullability of the first dataframe in the union, meaning a nullable column will become non-nullable, resulting in invalid values.
case class X(x: Option[Long])
case class Nested(nested: X)
val df1 = Seq(1L, 2L).toDF("x")
val df2 = Seq(Some(3L), None).toDF("x")
(df1 union df2).printSchema
(df1 union df2).as[X].collect
(df1 union df2).select("*").show
val struct1 = df1.select(struct('x) as "nested")
val struct2 = df2.select(struct('x) as "nested")
(struct1 union struct2).printSchema
(struct1 union struct2).as[Nested].collect
(struct1 union struct2).select("nested.*").show
(struct2 union struct1).printSchema
(struct2 union struct1).as[Y].collect
(struct2 union struct1).select("nested.*").show
Note the three BAD lines, where the union of structs became non-nullable and resulted in invalid values and exceptions.