Description
with benchmark original, where the timestamp values are valid to new parser
the result is
[info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5781 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 44764 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 93764 ms [info] Running case: from_json(timestamp) [info] Stopped after 3 iterations, 59021 ms
when we modify the benchmark to
def timestampStr: Dataset[String] = { spark.range(0, rowsNum, 1, 1).mapPartitions { iter => iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""") }.select($"value".as("timestamp")).as[String] } readBench.addCase("timestamp strings", numIters) { _ => timestampStr.noop() } readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ => spark.read.schema(tsSchema).json(timestampStr).noop() } readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ => spark.read.json(timestampStr).noop() }
where the timestamp values are invalid for the new parser which cause fallback to legacy parser.
the result is
[info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms
About 10x perf-regression