[SPARK-40474] Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: SQL
Labels:
None

Description

In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we introduced the support of date type in CSV schema inference. The schema inference behavior on date time columns now is:

For a column only containing dates, we will infer it as Date type
For a column only containing timestamps, we will infer it as Timestamp type
For a column containing a mixture of dates and timestamps, we will infer it as Timestamp type

However, we found that we are too ambitious on the last scenario, to support which we have introduced much complexity in code and caused a lot of performance concerns. Thus, we want to simplify and correct the behavior of the last scenario as:

For a column containing a mixture of dates and timestamps
- If user specifies timestamp format, it will always be inferred as `StringType`
- If no timestamp format specified by user, we will try inferring it as `TimestampType` if possible, otherwise it will be inferred as `StringType`