Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.0.0
-
None
Description
It seems the data sources implementing FileFormat seems loading the data by forcing the fields as nullable fields. It seems this was official documented SPARK-11360 and was discussed here https://www.mail-archive.com/user@spark.apache.org/msg39230.html
However, I realised that several APIs do not follow this. For example,
DataFrame.json(jsonRDD: RDD[String])
So, the codes below:
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}")) val schema = StructType(StructField("a", IntegerType, nullable = false) :: Nil) val df = spark.read.schema(schema).json(rdd) df.printSchema()
prints below:
root
|-- a: integer (nullable = false)
This API loads the schema as it is after loading. However, the schema became different when loading it by the API below (nullable fields) :
spark.read.format("json").schema(...).load(path).printSchema()
spark.read.schema(...).load(path).printSchema()
produce below:
root
|-- a: integer (nullable = true)
In addition, this is happening for structured streaming as well. (even when we read batch after writing it by structured streaming).
While testing, I wrote some tests codes and patches. Please see the following PR for more cases.
Attachments
Issue Links
- is duplicated by
-
SPARK-18270 Users schema with non-nullable properties is overidden with true
- Resolved
-
SPARK-27233 Schema of ArrayType change after saveAsTable and read
- Resolved
-
SPARK-27559 Nullable in a given schema is not respected when reading from parquet
- Resolved
- links to