[SPARK-16512] No way to load CSV data without dropping whole rows when some of data is not matched with given schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Currently, there is no way to read CSV data without dropping whole rows when some of data is not matched with given schema.

It seems there are some usecases as below:

a,b
1,c

Here, a can be a dirty data in real usecases.

But codes below:

val path = "/tmp/test.csv"
val schema = StructType(
  StructField("a", IntegerType, nullable = true) ::
  StructField("b", StringType, nullable = true) :: Nil
val df = spark.read
  .format("csv")
  .option("mode", "PERMISSIVE")
  .schema(schema)
  .load(path)
df.show()

emits the exception below:

java.lang.NumberFormatException: For input string: "a"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
	at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)

With DROPMALFORM and FAILFAST, it will be dropped or failed with an exception.

FYI, this is not the case for JSON because JSON data sources can handle this with PERMISSIVE mode as below:

val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()

+----+
|   a|
+----+
|   1|
|null|
+----+

Please refer https://github.com/databricks/spark-csv/pull/298

Attachments

Issue Links

duplicates

SPARK-18699 Spark CSV parsing types other than String throws exception when malformed

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Jul/16 01:19

Updated:: 12/Dec/22 18:10

Resolved:: 28/Feb/17 23:13