Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.5.0
Description
There is a regression in Spark 3.5.0 when inferring the schema of CSV files containing timestamps, where a column will be inferred as a timestamp even if the contents do not match the specified timestampFormat.
Test Data
I have the following CSV file:
2884-06-24T02:45:51.138 2884-06-24T02:45:51.138 2884-06-24T02:45:51.138
Spark 3.4.0 Behavior (correct)
In Spark 3.4.0, if I specify the correct timestamp format, then the schema is inferred as timestamp:
scala> val df = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", true).csv("/tmp/timestamps.csv") df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
If I specify an incompatible timestampFormat, then the schema is inferred as string:
scala> val df = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", true).csv("/tmp/timestamps.csv") df: org.apache.spark.sql.DataFrame = [_c0: string]
Spark 3.5.0
In Spark 3.5.0, the column will be inferred as timestamp even if the data does not match the specified timestampFormat.
scala> val df = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", true).csv("/tmp/timestamps.csv") df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
Reading the DataFrame then results in an error:
Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
Attachments
Issue Links
- links to