Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16460

Spark 2.0 CSV ignores NULL value in Date format

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.0.1, 2.1.0
    • SQL
    • None
    • SparkR

    Description

      Trying to read a CSV file to Spark (using SparkR) containing just this data row:

          1|1998-01-01||
      

      Using Spark 1.6.2 (Hadoop 2.6) gives me

          > head(sdf)
            id          d dtwo
          1  1 1998-01-01   NA
      

      Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error:

      > Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.text.ParseException: Unparseable date: ""
      at java.text.DateFormat.parse(DateFormat.java:357)
      at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289)
      at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98)
      at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74)
      at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
      at org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
      at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      at scala.collection.Iterator$$anon$12.hasNext(Itera...

      The problem seems indeed the NULL value here as with a valid date in the third CSV column it works.

      R code:

          #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') 
          Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
          .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
          library(SparkR)
          
          sc <-
              sparkR.init(
                  master = "local",
                  sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
              )
          sqlContext <- sparkRSQL.init(sc)
          
          
          st <- structType(structField("id", "integer"), structField("d", "date"), structField("dtwo", "date"))
          
          sdf <- read.df(
              sqlContext,
              path = "d:/date_test.csv",
              source = "com.databricks.spark.csv",
              schema = st,
              inferSchema = "false",
              delimiter = "|",
              dateFormat = "yyyy-MM-dd",
              nullValue = "",
              mode = "PERMISSIVE"
          )
          
          head(sdf)
          
          sparkR.stop()
      

      Attachments

        Issue Links

          Activity

            People

              proflin Liwei Lin(Inactive)
              marcelboldt Marcel Boldt
              Votes:
              3 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: