Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.0.0
-
None
-
SparkR
Description
Trying to read a CSV file to Spark (using SparkR) containing just this data row:
1|1998-01-01||
Using Spark 1.6.2 (Hadoop 2.6) gives me
> head(sdf) id d dtwo 1 1 1998-01-01 NA
Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error:
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.text.ParseException: Unparseable date: ""
at java.text.DateFormat.parse(DateFormat.java:357)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74)
at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Itera...
The problem seems indeed the NULL value here as with a valid date in the third CSV column it works.
R code:
#Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7') .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library(SparkR) sc <- sparkR.init( master = "local", sparkPackages = "com.databricks:spark-csv_2.11:1.4.0" ) sqlContext <- sparkRSQL.init(sc) st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date")) sdf <- read.df( sqlContext, path = "d:/date_test.csv", source = "com.databricks.spark.csv", schema = st, inferSchema = "false", delimiter = "|", dateFormat = "yyyy-MM-dd", nullValue = "", mode = "PERMISSIVE" ) head(sdf) sparkR.stop()
Attachments
Issue Links
- is related to
-
SPARK-16981 For CSV files nullValue is not respected for Date/Time data type
- Resolved
- relates to
-
SPARK-16462 Spark 2.0 CSV does not cast null values to certain data types properly
- Resolved
- links to