Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
2.3.0
-
None
-
None
Description
Spark currently pushes the predicates it has in the SQL query to Hive Metastore. This only applies to predicates that are placed on top of partitioning columns. As more and more hive metastore implementations come around, this is an important optimization to allow data to be prefiltered to only relevant partitions. Consider the following example:
Table:
create external table data (key string, quantity long)
partitioned by (processing-date timestamp)
Query:
select * from data where processing-date = '2017-10-23 00:00:00'
Currently, no filters will be pushed to the hive metastore for the above query. The reason is that the code that tries to compute predicates to be sent to hive metastore, only deals with integral and string column types. It doesn't know how to handle fractional and timestamp columns.
I have tables in my metastore (AWS Glue) with millions of partitions of type timestamp and double. In my specific case, it takes Spark's master node about 6.5 minutes to download all partitions for the table, and then filter the partitions client-side. The actual processing time of my query is only 6 seconds. In other words, without partition pruning, I'm looking at 6.5 minutes of processing and with partition pruning, I'm looking at 6 seconds only.
I have a fix for this developed locally that I'll provide shortly as a pull request.
Attachments
Issue Links
- Is contained by
-
SPARK-23443 Spark with Glue as external catalog
- Open
- links to