Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21768

spark.csv.read Empty String Parsed as NULL when nullValue is Set

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.0.2, 2.2.0
    • None
    • PySpark, SQL
    • None
    • AWS EMR Spark 2.2.0 (also Spark 2.0.2)
      PySpark

    Description

      In a CSV with quoted fields, empty strings will be interpreted as NULL even when a nullValue is explicitly set:

      Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX

      "XXNULLXX"|""|"XXNULLXX"|"foo"

      PySpark Script to load the file (from S3):

      load.py
      from pyspark.sql import SparkSession
      from pyspark.sql.types import StringType, StructField, StructType
      
      spark = SparkSession.builder.appName("test_csv").getOrCreate()
      
      fields = []
      fields.append(StructField("First Null Field", StringType(), True))
      fields.append(StructField("Empty String Field", StringType(), True))
      fields.append(StructField("Second Null Field", StringType(), True))
      fields.append(StructField("Non Empty String Field", StringType(), True))
      schema = StructType(fields)
      
      keys = ['s3://mybucket/test/demo.csv']
      
      bad_data = spark.read.csv(keys, timestampFormat="yyyy-MM-dd HH:mm:ss", mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema)
      bad_data.show()
      

      Output

      +----------------+------------------+-----------------+----------------------+
      |First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
      +----------------+------------------+-----------------+----------------------+
      |            null|              null|             null|                   foo|
      +----------------+------------------+-----------------+----------------------+
      

      Expected Output:

      +----------------+------------------+-----------------+----------------------+
      |First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
      +----------------+------------------+-----------------+----------------------+
      |            null|                  |             null|                   foo|
      +----------------+------------------+-----------------+----------------------+
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              andrew.w.gross@gmail.com Andrew Gross
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: