Details
Description
In both ORC data sources, createFilter function has exponential time complexity due to its skewed filter tree generation. This PR aims to improve it by using new buildTree function.
REPRODUCE
// Create and read 1 row table with 1000 columns sql("set spark.sql.orc.filterPushdown=true") val selectExpr = (1 to 1000).map(i => s"id c$i") spark.range(1).selectExpr(selectExpr: _*).write.mode("overwrite").orc("/tmp/orc") print(s"With 0 filters, ") spark.time(spark.read.orc("/tmp/orc").count) // Increase the number of filters (20 to 30).foreach { width => val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") print(s"With $width filters, ") spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) }
RESULT
With 0 filters, Time taken: 653 ms With 20 filters, Time taken: 962 ms With 21 filters, Time taken: 1282 ms With 22 filters, Time taken: 1982 ms With 23 filters, Time taken: 3855 ms With 24 filters, Time taken: 6719 ms With 25 filters, Time taken: 12669 ms With 26 filters, Time taken: 25032 ms With 27 filters, Time taken: 49585 ms With 28 filters, Time taken: 98980 ms // over 1 min 38 seconds With 29 filters, Time taken: 198368 ms // over 3 mins With 30 filters, Time taken: 393744 ms // over 6 mins
Attachments
Issue Links
- blocks
-
SPARK-20901 Feature parity for ORC with Parquet
- Open
- links to