Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24969

SQL: to_date function can't parse date strings in different locales.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.2.1
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
    • Environment:

      Bare Spark 2.2.1 installation, on RHEL 6.

      Description

      The locale for org.apache.spark.sql.catalyst.util.DateTimeUtils, that is internally used by to_date SQL function, is set in code to be Locale.US.

      This causes problems parsing a dataset which has dates in a different (italian in this case) language.

      spark.read.format("csv")
                  .option("sep", ";")
                  .csv(logFile)
                  .toDF("DATA", .....)
                  .withColumn("DATA2", to_date(col("DATA"), "yyyy MMM"))
                  .show(10)
      

      Results from example dataset:

      DATA DATA2
      2018 giu null
      2018 mag null
      2018 apr 2018-04-01
      2018 mar 2018-03-01
      2018 feb 2018-02-01
      2018 gen null
      2017 dic null
      2017 nov 2017-11-01
      2017 ott null
      2017 set null

      Expected results: All values converted.

      TEMPORARY WORKAROUND:

      In object org.apache.spark.sql.catalyst.util.DateTimeUtils, replace all instances of Locale.US with Locale.<your locale>

      ADDITIONAL NOTES:

      I can make a pull request available on GitHub.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Pinna Valentino Pinna
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: