Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45424

Regression in CSV schema inference when timestamps do not match specified timestampFormat

    XMLWordPrintableJSON

Details

    Description

      There is a regression in Spark 3.5.0 when inferring the schema of CSV files containing timestamps, where a column will be inferred as a timestamp even if the contents do not match the specified timestampFormat.

      Test Data

      I have the following CSV file:

      2884-06-24T02:45:51.138
      2884-06-24T02:45:51.138
      2884-06-24T02:45:51.138
      

      Spark 3.4.0 Behavior (correct)

      In Spark 3.4.0, if I specify the correct timestamp format, then the schema is inferred as timestamp:

      scala> val df = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", true).csv("/tmp/timestamps.csv")
      df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
      

      If I specify an incompatible timestampFormat, then the schema is inferred as string:

      scala> val df = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", true).csv("/tmp/timestamps.csv")
      df: org.apache.spark.sql.DataFrame = [_c0: string]
      

      Spark 3.5.0

      In Spark 3.5.0, the column will be inferred as timestamp even if the data does not match the specified timestampFormat.

      scala> val df = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("inferSchema", true).csv("/tmp/timestamps.csv")
      df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
      

      Reading the DataFrame then results in an error:

      Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
      

      Attachments

        Issue Links

          Activity

            People

              fanjia Jia Fan
              andygrove Andy Grove
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: