Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.3.2, 2.4.3
-
None
-
None
Description
When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged as such and read in. Only way to get them flagged is to manually set "columnNameOfCorruptRecord" AND manually setting the schema including this column. Example:
// Second row has a 4th column that is not declared in the header/schema val csvText = s""" | FieldA, FieldB, FieldC | a1,b1,c1 | a2,b2,c2,d*""".stripMargin val csvFile = new File("/tmp/file.csv") FileUtils.write(csvFile, csvText) val reader = sqlContext.read .format("csv") .option("header", "true") .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "corrupt") .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING") reader.load(csvFile.getAbsolutePath).show(truncate = false)
This produces the correct result:
+------------+------+------+------+
|corrupt |fieldA|fieldB|fieldC|
+------------+------+------+------+
|null | a1 |b1 |c1 |
| a2,b2,c2,d*| a2 |b2 |c2 |
+------------+------+------+------+
However removing the "schema" option and going:
val reader = sqlContext.read .format("csv") .option("header", "true") .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "corrupt") reader.load(csvFile.getAbsolutePath).show(truncate = false)
Yields:
+-------+-------+-------+ | FieldA| FieldB| FieldC| +-------+-------+-------+ | a1 |b1 |c1 | | a2 |b2 |c2 | +-------+-------+-------+
The fourth value "d*" in the second row has been removed and the row not marked as corrupt
Attachments
Issue Links
- duplicates
-
SPARK-28058 Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
- Resolved