[SPARK-25306] Avoid skewed filter trees to speed up `createFilter` in ORC - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
None

Description

In both ORC data sources, createFilter function has exponential time complexity due to its skewed filter tree generation. This PR aims to improve it by using new buildTree function.

REPRODUCE

// Create and read 1 row table with 1000 columns
sql("set spark.sql.orc.filterPushdown=true")
val selectExpr = (1 to 1000).map(i => s"id c$i")
spark.range(1).selectExpr(selectExpr: _*).write.mode("overwrite").orc("/tmp/orc")
print(s"With 0 filters, ")
spark.time(spark.read.orc("/tmp/orc").count)

// Increase the number of filters
(20 to 30).foreach { width =>
  val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
  print(s"With $width filters, ")
  spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
}

RESULT

With 0 filters, Time taken: 653 ms                                              
With 20 filters, Time taken: 962 ms
With 21 filters, Time taken: 1282 ms
With 22 filters, Time taken: 1982 ms
With 23 filters, Time taken: 3855 ms
With 24 filters, Time taken: 6719 ms
With 25 filters, Time taken: 12669 ms
With 26 filters, Time taken: 25032 ms
With 27 filters, Time taken: 49585 ms
With 28 filters, Time taken: 98980 ms    // over 1 min 38 seconds
With 29 filters, Time taken: 198368 ms   // over 3 mins
With 30 filters, Time taken: 393744 ms   // over 6 mins

Attachments

Issue Links

blocks

SPARK-20901 Feature parity for ORC with Parquet

Open

links to

[Github] Pull Request #22313 (dongjoon-hyun)

[Github] Pull Request #22336 (dongjoon-hyun)

Activity

People

Assignee:: Dongjoon Hyun

Reporter:: Dongjoon Hyun

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Sep/18 22:21

Updated:: 05/Sep/18 05:17

Resolved:: 05/Sep/18 02:24