Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40474

Correct CSV schema inference and data parsing behavior on columns with mixed dates and timestamps

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • SQL
    • None

    Description

      In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we introduced the support of date type in CSV schema inference. The schema inference behavior on date time columns now is:

      • For a column only containing dates, we will infer it as Date type
      • For a column only containing timestamps, we will infer it as Timestamp type
      • For a column containing a mixture of dates and timestamps, we will infer it as Timestamp type

      However, we found that we are too ambitious on the last scenario, to support which we have introduced much complexity in code and caused a lot of performance concerns. Thus, we want to simplify and correct the behavior of the last scenario as:

      • For a column containing a mixture of dates and timestamps
        • If user specifies timestamp format, it will always be inferred as `StringType`
        • If no timestamp format specified by user, we will try inferring it as `TimestampType` if possible, otherwise it will be inferred as `StringType`

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            xiaonany94 Xiaonan Yang
            xiaonany94 Xiaonan Yang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment