Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
2.0.0
-
None
-
None
Description
Currently, there is no way to read CSV data without dropping whole rows when some of data is not matched with given schema.
It seems there are some usecases as below:
a,b 1,c
Here, a can be a dirty data in real usecases.
But codes below:
val path = "/tmp/test.csv" val schema = StructType( StructField("a", IntegerType, nullable = true) :: StructField("b", StringType, nullable = true) :: Nil val df = spark.read .format("csv") .option("mode", "PERMISSIVE") .schema(schema) .load(path) df.show()
emits the exception below:
java.lang.NumberFormatException: For input string: "a" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
With DROPMALFORM and FAILFAST, it will be dropped or failed with an exception.
FYI, this is not the case for JSON because JSON data sources can handle this with PERMISSIVE mode as below:
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}")) val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil) spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
+----+
| a|
+----+
| 1|
|null|
+----+
Please refer https://github.com/databricks/spark-csv/pull/298
Attachments
Issue Links
- duplicates
-
SPARK-18699 Spark CSV parsing types other than String throws exception when malformed
- Resolved