Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31408 Build Spark’s own datetime pattern definition
  3. SPARK-31414

Performance regression with new TimestampFormatter for json and csv

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • SQL
    • None

    Description

      with benchmark original, where the timestamp values are valid to new parser

      the result is

      [info] Running benchmark: Read dates and timestamps
      [info]   Running case: timestamp strings
      [info]   Stopped after 3 iterations, 5781 ms
      [info]   Running case: parse timestamps from Dataset[String]
      [info]   Stopped after 3 iterations, 44764 ms
      [info]   Running case: infer timestamps from Dataset[String]
      [info]   Stopped after 3 iterations, 93764 ms
      [info]   Running case: from_json(timestamp)
      [info]   Stopped after 3 iterations, 59021 ms
      

      when we modify the benchmark to

            def timestampStr: Dataset[String] = {
              spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
                iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
              }.select($"value".as("timestamp")).as[String]
            }
      
            readBench.addCase("timestamp strings", numIters) { _ =>
              timestampStr.noop()
            }
      
            readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ =>
              spark.read.schema(tsSchema).json(timestampStr).noop()
            }
      
            readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ =>
              spark.read.json(timestampStr).noop()
            }
      

      where the timestamp values are invalid for the new parser which cause fallback to legacy parser.
      the result is

      [info] Running benchmark: Read dates and timestamps
      [info]   Running case: timestamp strings
      [info]   Stopped after 3 iterations, 5623 ms
      [info]   Running case: parse timestamps from Dataset[String]
      [info]   Stopped after 3 iterations, 506637 ms
      [info]   Running case: infer timestamps from Dataset[String]
      [info]   Stopped after 3 iterations, 509076 ms
      

      About 10x perf-regression

      Attachments

        Activity

          People

            Qin Yao Kent Yao 2
            Qin Yao Kent Yao 2
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: