Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17066

dateFormat should be used when writing dataframes as csv files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.0.0
    • 2.0.1, 2.1.0
    • Input/Output
    • None

    Description

      I noticed this when running tests after pulling and building @lw-lin 's PR (https://github.com/apache/spark/pull/14118). I don't think it is anything wrong with his PR, just that the fix that was made to spark-csv for this issue was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 was fixed in spark-csv after that merge.

      The problem is that if I try to write a dataframe that contains a date column out to a csv using something like this

      repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
      .option("delimiter", "\t")
      .option("header", "false")
      .option("nullValue", "?")
      .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
      .option("escape", "
      ")
      .save(tempFileName)

      Then my unit test (which passed under spark 1.6.2) fails using the spark 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a date column.

      Expected "[2012-01-03T09:12:00
      ?
      2015-02-23T18:00:]00",
      but got
      "[1325610720000000
      ?
      14247432000000]00"

      This means that while the null value is being correctly exported, the specified dateFormat is not being used to format the date. Instead it looks like number of seconds from epoch is being used.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              barrybecker4 Barry Becker
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: