Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.3.0
-
None
Description
Currently, CSV and JSON datasource set the nullable flag to true independently from data itself during schema inferring.
JSON: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
CSV: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51
For example, source dataset has schema:
root |-- item_id: integer (nullable = false) |-- country: string (nullable = false) |-- state: string (nullable = false)
If we save it and read again the schema of the inferred dataset is
root |-- item_id: integer (nullable = true) |-- country: string (nullable = true) |-- state: string (nullable = true)
The ticket aims to set the nullable flag more precisely during schema inferring based on read data.