[SPARK-21768] spark.csv.read Empty String Parsed as NULL when nullValue is Set - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.0.2, 2.2.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None
Environment:

AWS EMR Spark 2.2.0 (also Spark 2.0.2)
PySpark

Description

In a CSV with quoted fields, empty strings will be interpreted as NULL even when a nullValue is explicitly set:

Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX

"XXNULLXX"|""|"XXNULLXX"|"foo"

PySpark Script to load the file (from S3):

load.py

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, StructField, StructType

spark = SparkSession.builder.appName("test_csv").getOrCreate()

fields = []
fields.append(StructField("First Null Field", StringType(), True))
fields.append(StructField("Empty String Field", StringType(), True))
fields.append(StructField("Second Null Field", StringType(), True))
fields.append(StructField("Non Empty String Field", StringType(), True))
schema = StructType(fields)

keys = ['s3://mybucket/test/demo.csv']

bad_data = spark.read.csv(keys, timestampFormat="yyyy-MM-dd HH:mm:ss", mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema)
bad_data.show()

Output

+----------------+------------------+-----------------+----------------------+
|First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
+----------------+------------------+-----------------+----------------------+
|            null|              null|             null|                   foo|
+----------------+------------------+-----------------+----------------------+

Expected Output:

+----------------+------------------+-----------------+----------------------+
|First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
+----------------+------------------+-----------------+----------------------+
|            null|                  |             null|                   foo|
+----------------+------------------+-----------------+----------------------+

Attachments

Issue Links

duplicates

SPARK-17916 CSV data source treats empty string as null no matter what nullValue option is

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Andrew Gross

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Aug/17 20:04

Updated:: 18/Aug/17 13:40

Resolved:: 18/Aug/17 13:40