Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23173

from_json can produce nulls for fields which are marked as non-nullable

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.1
    • 2.3.1, 2.4.0
    • SQL

    Description

      The from_json function uses a schema to convert a string into a Spark SQL struct. This schema can contain non-nullable fields. The underlying JsonToStructs expression does not check if a resulting struct respects the nullability of the schema. This leads to very weird problems in consuming expressions. In our case parquet writing would produce an illegal parquet file.

      There are roughly solutions here:

      1. Assume that each field in schema passed to from_json is nullable, and ignore the nullability information set in the passed schema.
      2. Validate the object during runtime, and fail execution if the data is null where we are not expecting this.
        I currently am slightly in favor of option 1, since this is the more performant option and a lot easier to do.

      WDYT? cc Reynold Xin Michael Armbrust Hyukjin Kwon Burak Yavuz

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mswit Michał Świtakowski
            hvanhovell Herman van Hövell
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment