Details
Description
In a CSV with quoted fields, empty strings will be interpreted as NULL even when a nullValue is explicitly set:
Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX
"XXNULLXX"|""|"XXNULLXX"|"foo"
PySpark Script to load the file (from S3):
load.py
from pyspark.sql import SparkSession from pyspark.sql.types import StringType, StructField, StructType spark = SparkSession.builder.appName("test_csv").getOrCreate() fields = [] fields.append(StructField("First Null Field", StringType(), True)) fields.append(StructField("Empty String Field", StringType(), True)) fields.append(StructField("Second Null Field", StringType(), True)) fields.append(StructField("Non Empty String Field", StringType(), True)) schema = StructType(fields) keys = ['s3://mybucket/test/demo.csv'] bad_data = spark.read.csv(keys, timestampFormat="yyyy-MM-dd HH:mm:ss", mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema) bad_data.show()
Output
+----------------+------------------+-----------------+----------------------+ |First Null Field|Empty String Field|Second Null Field|Non Empty String Field| +----------------+------------------+-----------------+----------------------+ | null| null| null| foo| +----------------+------------------+-----------------+----------------------+
Expected Output:
+----------------+------------------+-----------------+----------------------+ |First Null Field|Empty String Field|Second Null Field|Non Empty String Field| +----------------+------------------+-----------------+----------------------+ | null| | null| foo| +----------------+------------------+-----------------+----------------------+
Attachments
Issue Links
- duplicates
-
SPARK-17916 CSV data source treats empty string as null no matter what nullValue option is
- Resolved