Description
Consider a view as follows with all fields non-nullable (required)
spark.sql("""
CREATE OR REPLACE VIEW v AS
SELECT id, named_struct('a', id) AS nested
FROM RANGE(10)
""")
we can see that the view schema has been correctly stored as non-nullable
scala> System.out.println(spark.sessionState.catalog.externalCatalog.getTable("default", "v2")) CatalogTable( Database: default Table: v2 Owner: smahadik Created Time: Tue Dec 07 09:00:42 PST 2021 Last Access: UNKNOWN Created By: Spark 3.3.0-SNAPSHOT Type: VIEW View Text: SELECT id, named_struct('a', id) AS nested FROM RANGE(10) View Original Text: SELECT id, named_struct('a', id) AS nested FROM RANGE(10) View Catalog and Namespace: spark_catalog.default View Query Output Columns: [id, nested] Table Properties: [transient_lastDdlTime=1638896442] Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties: [serialization.format=1] Schema: root |-- id: long (nullable = false) |-- nested: struct (nullable = false) | |-- a: long (nullable = false) )
However, when trying to read this view, it incorrectly marks nested column a as nullable
scala> spark.table("v2").printSchema root |-- id: long (nullable = false) |-- nested: struct (nullable = false) | |-- a: long (nullable = true)
This is caused by this line in Analyzer.scala. Going through the history of changes for this block of code, it seems like asNullable is a remnant of a time before we added checks to ensure that the from and to types of the cast were compatible. As nullability is already checked, it should be safe to add a cast without converting the target datatype to nullable.