Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29575

from_json can produce nulls for fields which are marked as non-nullable

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.4.4
    • None
    • PySpark, SQL
    • None

    Description

      I believe this issue was resolved elsewhere (https://issues.apache.org/jira/browse/SPARK-23173), though for Pyspark this bug seems to still be there.

      The issue appears when using from_json to parse a column in a Spark dataframe. It seems like from_json ignores whether the schema provided has any nullable:False property.

      schema = T.StructType().add(T.StructField('id', T.LongType(), nullable=False)).add(T.StructField('name', T.StringType(), nullable=False))
      data = [{'user': str({'name': 'joe', 'id':1})}, {'user': str({'name': 'jane'})}]
      df = spark.read.json(sc.parallelize(data))
      df.withColumn("details", F.from_json("user", schema)).select("details.*").show()
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            lo_p_ez Victor Lopez
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: