Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18072

empty/null Timestamp field

    XMLWordPrintableJSON

    Details

    • Type: Question
    • Status: Resolved
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: Input/Output
    • Environment:

      hadoop 2.7.1, ubuntu 15.10, databricks 1.5, spark-csv 1.5.0, scala 2.11.8

      Description

      I was asked by Hossein Falaki to create a jira here, previously it was reported as databricks' issue on github https://github.com/databricks/spark-csv/issues/388#issuecomment-255631718

      I have problem with spark 2.0.0, spark-csv 1.5.0, and scala 2.11.8.

      I have a csv file that I want to convert to parquet. There is a column with timestamps and some of them are missing, those are empty strings (without quotes, and it is not even a spacer, just new line straightaway as that is the last column). I get exception thrown:

      16/10/23 02:46:08 ERROR Utils: Aborting task
      java.lang.IllegalArgumentException
      	at java.sql.Date.valueOf(Date.java:143)
      	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
      	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:115)
      	at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:84)
      ...
      

      The options I use when reading csv

      "delimiter" -> ","
      "header" -> "true"
      "inferSchema" -> "true"
      "treatEmptyValuesAsNulls" ->"true"
      "nullValue"->""
      

      The execution goes through CSVINferSchema.scala (lines 284-287) in *spark-sql_2.11-2.0.0-sources.jar*

            case _: TimestampType =>
              // This one will lose microseconds parts.
              // See https://issues.apache.org/jira/browse/SPARK-10681.
              DateTimeUtils.stringToTime(datum).getTime  * 1000L
      

      it invokes `Date.valueOf(s)` in DateTimeUtils.scala spark-catalyst_2.11-2.0.0-sources.jar that then throws excepion in java.sql.Date.valueOf.

      Is that a bug, I am doing something wrong, or there is a way to pass a default value?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mpekalski marcin pekalski
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: