Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16472

Inconsistent nullability in schema after being read

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.0.0
    • None
    • SQL

    Description

      It seems the data sources implementing FileFormat seems loading the data by forcing the fields as nullable fields. It seems this was official documented SPARK-11360 and was discussed here https://www.mail-archive.com/user@spark.apache.org/msg39230.html

      However, I realised that several APIs do not follow this. For example,

      DataFrame.json(jsonRDD: RDD[String])
      

      So, the codes below:

      val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
      val schema = StructType(StructField("a", IntegerType, nullable = false) :: Nil)
      val df = spark.read.schema(schema).json(rdd)
      df.printSchema()
      

      prints below:

      root
       |-- a: integer (nullable = false)
      

      This API loads the schema as it is after loading. However, the schema became different when loading it by the API below (nullable fields) :

      spark.read.format("json").schema(...).load(path).printSchema()
      
      spark.read.schema(...).load(path).printSchema()
      

      produce below:

      root
       |-- a: integer (nullable = true)
      

      In addition, this is happening for structured streaming as well. (even when we read batch after writing it by structured streaming).

      While testing, I wrote some tests codes and patches. Please see the following PR for more cases.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gurwls223 Hyukjin Kwon
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: