Description
I found an issue where filtering partitions using a dataframe results in very bad performance.
To reproduce:
Create a hive table with a lot of partitions and write a spark query on that table that filters based on the partition column.
In my use case I've got a table with about 30k partitions.
I filter the partitions using some scala via spark-shell:
table.filter("partition=x or partition=y")
This results in a Hive thrift API call:{{ #get_partitions('db', 'table', -1)}} which is very slow (minutes) and loads all metastore partitions in memory.
Doing a more simple filter:
table.filter("partition=x)
Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', 'partition = "x', -1)}} which is very fast (seconds) and only fetches partition X into memory.
If possible Spark should translate the filter into the more performant Thrift call or fallback to a more scalable solution where it filters our partitions without having to loading them all into memory first (for instance fetching the partitions in batches).
I've posted my original question on SO
Attachments
Issue Links
- duplicates
-
SPARK-20331 Broaden support for Hive partition pruning predicate pushdown
- Resolved