[SPARK-25545] CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not confirm to non-nullable schema fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.3.0, 2.3.1, 2.3.2
Fix Version/s: None
Component/s: SQL
Labels:
- CSV
- csv
- csvparser

Description

I'm loading a CSV file into a dataframe using Spark. I have defined a Schema and specified one of the fields as non-nullable.

When setting the mode to DROPMALFORMED, I expect any rows in the CSV with missing (null) values for those columns to result in the whole row being dropped. At the moment, the CSV loader correctly drops rows that do not conform to the field type, but the nullable property is seemingly ignored.

Example CSV input:

1,2,3
1,,3
,2,3
1,2,abc

Example Spark job:

val spark = SparkSession
  .builder()
  .appName("csv-test")
  .master("local")
  .getOrCreate()

spark.read
  .format("csv")
  .schema(StructType(
    StructField("col1", IntegerType, nullable = false) ::
      StructField("col2", IntegerType, nullable = false) ::
      StructField("col3", IntegerType, nullable = false) :: Nil))
  .option("header", false)
  .option("mode", "DROPMALFORMED")
  .load("path/to/file.csv")
  .coalesce(1)
  .write
  .format("csv")
  .option("header", false)
  .save("path/to/output")

The actual output will be:

1,2,3
1,,3
,2,3

Note that the row containing non-integer values has been dropped, as expected, but rows containing null values persist, despite the nullable property being set to false in the schema definition.

My expected output is:

1,2,3

Attachments

Issue Links

duplicates

SPARK-20457 Spark CSV is not able to Override Schema while reading data

Resolved

is related to

SPARK-10848 Applied JSON Schema Works for json RDD but not when loading json file

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Steven Bakhtiari

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Sep/18 15:26

Updated:: 12/Dec/22 18:10

Resolved:: 27/Sep/18 09:30