Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24269

Infer nullability rather than declaring all columns as nullable

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.0
    • None
    • SQL

    Description

      Currently, CSV and JSON datasource set the nullable flag to true independently from data itself during schema inferring.

      JSON: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala#L126
      CSV: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L51

      For example, source dataset has schema:

      root
       |-- item_id: integer (nullable = false)
       |-- country: string (nullable = false)
       |-- state: string (nullable = false)
      

      If we save it and read again the schema of the inferred dataset is

      root
       |-- item_id: integer (nullable = true)
       |-- country: string (nullable = true)
       |-- state: string (nullable = true)
      

      The ticket aims to set the nullable flag more precisely during schema inferring based on read data.

      Attachments

        Activity

          People

            Unassigned Unassigned
            maxgekk Max Gekk
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: