[SPARK-16460] Spark 2.0 CSV ignores NULL value in Date format - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: SQL
Labels:
None
Environment:

SparkR

External issue URL:
http://stackoverflow.com/questions/38265640/spark-2-0-pre-csv-parsing-error-if-missing-values-in-date-column
External issue ID:
~~SPARK-13667~~

Description

Trying to read a CSV file to Spark (using SparkR) containing just this data row:

    1|1998-01-01||

Using Spark 1.6.2 (Hadoop 2.6) gives me

    > head(sdf)
      id          d dtwo
    1  1 1998-01-01   NA

Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error:

> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.text.ParseException: Unparseable date: ""
at java.text.DateFormat.parse(DateFormat.java:357)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74)
at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Itera...

The problem seems indeed the NULL value here as with a valid date in the third CSV column it works.

R code:

    #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') 
    Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
    .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
    library(SparkR)
    
    sc <-
        sparkR.init(
            master = "local",
            sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
        )
    sqlContext <- sparkRSQL.init(sc)
    
    
    st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date"))
    
    sdf <- read.df(
        sqlContext,
        path = "d:/date_test.csv",
        source = "com.databricks.spark.csv",
        schema = st,
        inferSchema = "false",
        delimiter = "|",
        dateFormat = "yyyy-MM-dd",
        nullValue = "",
        mode = "PERMISSIVE"
    )
    
    head(sdf)
    
    sparkR.stop()

Attachments

Issue Links

is related to

SPARK-16981 For CSV files nullValue is not respected for Date/Time data type

Resolved

relates to

SPARK-16462 Spark 2.0 CSV does not cast null values to certain data types properly

Resolved

links to

[Github] Pull Request #14118 (lw-lin)

Activity

People

Assignee:: Liwei Lin(Inactive)

Reporter:: Marcel Boldt

Votes:: 3 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Jul/16 12:07

Updated:: 18/Sep/16 18:27

Resolved:: 18/Sep/16 18:26