[SPARK-22247] Hive partition filter very slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.0.2, 2.1.1
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

I found an issue where filtering partitions using a dataframe results in very bad performance.

To reproduce:
Create a hive table with a lot of partitions and write a spark query on that table that filters based on the partition column.

In my use case I've got a table with about 30k partitions.
I filter the partitions using some scala via spark-shell:
table.filter("partition=x or partition=y")
This results in a Hive thrift API call:{{ #get_partitions('db', 'table', -1)}} which is very slow (minutes) and loads all metastore partitions in memory.

Doing a more simple filter:
table.filter("partition=x)
Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', 'partition = "x', -1)}} which is very fast (seconds) and only fetches partition X into memory.

If possible Spark should translate the filter into the more performant Thrift call or fallback to a more scalable solution where it filters our partitions without having to loading them all into memory first (for instance fetching the partitions in batches).

I've posted my original question on SO

Attachments

Issue Links

duplicates

SPARK-20331 Broaden support for Hive partition pruning predicate pushdown

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Patrick Duin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Oct/17 08:57

Updated:: 31/Jan/18 18:51

Resolved:: 16/Oct/17 11:10