Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39279 Fasten the schema inference of CSV/JSON data source
  3. SPARK-39193

Fasten Timestamp type inference of default format in JSON/CSV data source

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0
    • 3.3.0
    • SQL
    • None

    Description

      When reading JSON/CSV files with inferring timestamp types `.option("inferTimestamp", true)`, the Timestamp conversion will throw and catch exceptions. As we are putting decent error messages in the exception, the creation of the exceptions is actually not cheap. It consumes more than 90% of the type inference time. 

      We can use the parsing methods which return optional results instead.

      Before the change, it takes 166 seconds to infer a JSON file of 624MB with inferring timestamp enabled.

      After the change, it only 16 seconds.

      Attachments

        Activity

          People

            Gengliang.Wang Gengliang Wang
            Gengliang.Wang Gengliang Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: