[SPARK-17066] dateFormat should be used when writing dataframes as csv files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.0.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: Input/Output
Labels:
None

Description

I noticed this when running tests after pulling and building @lw-lin 's PR (https://github.com/apache/spark/pull/14118). I don't think it is anything wrong with his PR, just that the fix that was made to spark-csv for this issue was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 was fixed in spark-csv after that merge.

The problem is that if I try to write a dataframe that contains a date column out to a csv using something like this

repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
.option("delimiter", "\t")
.option("header", "false")
.option("nullValue", "?")
.option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
.option("escape", "
")
.save(tempFileName)

Then my unit test (which passed under spark 1.6.2) fails using the spark 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a date column.

Expected "[2012-01-03T09:12:00
?
2015-02-23T18:00:]00",
but got
"[1325610720000000
?
14247432000000]00"

This means that while the null value is being correctly exported, the specified dateFormat is not being used to format the date. Instead it looks like number of seconds from epoch is being used.

Attachments

Issue Links

duplicates

SPARK-16216 CSV data source does not write date and timestamp correctly

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Barry Becker

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 15/Aug/16 21:23

Updated:: 25/Aug/16 04:20

Resolved:: 16/Aug/16 07:02