Description
In Spark 3.x, when reading CSV data like this:
name,mydate 1,2020011 2,20201203
and specifying date pattern as "yyyyMMdd", dates are not parsed correctly with CORRECTED time parser policy.
For example,
val df = spark.read.schema("name string, mydate date").option("dateFormat", "yyyyMMdd").option("header", "true").csv("file:/tmp/test.csv") df.show(false)
Returns:
+----+--------------+ |name|mydate | +----+--------------+ |1 |+2020011-01-01| |2 |2020-12-03 | +----+--------------+
and it used to return null instead of the invalid date in Spark 3.2 or below.
The issue appears to be caused by this PR: https://github.com/apache/spark/pull/32959.
A similar issue can observed in JSON data source.
test.json
{"date": "2020011"} {"date": "20201203"}
Running commands
val df = spark.read.schema("date date").option("dateFormat", "yyyyMMdd").json("file:/tmp/test.json") df.show(false)
returns
+--------------+ |date | +--------------+ |+2020011-01-01| |2020-12-03 | +--------------+
but before the patch linked in the description it used to show:
+----------+ |date | +----------+ |7500-08-09| |2020-12-03| +----------+
which is strange either way. I will try to address it in the PR.
Attachments
Issue Links
- is related to
-
SPARK-40496 Configs to control "enableDateTimeParsingFallback" are incorrectly swapped
- Resolved
-
SPARK-40215 Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
- Resolved
- links to