[SPARK-16980] Load only catalog table partition metadata required to answer a query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.1.0
Component/s: SQL
Labels:
None

Target Version/s:

2.1.0

Description

Currently, when a user reads from a partitioned Hive table whose metadata are not cached (and for which Hive table conversion is enabled and supported), all partition metadata are fetched from the metastore:

https://github.com/apache/spark/blob/5effc016c893ce917d535cc1b5026d8e4c846721/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L252-L260

However, if the user's query includes partition pruning predicates then we only need the subset of these metadata which satisfy those predicates.

This issue tracks work to modify the current query planning scheme so that unnecessary partition metadata are not loaded.

I've prototyped two possible approaches. The first extends o.a.s.s.c.catalog.ExternalCatalog and as such is more generally applicable. It requires some new abstractions and refactoring of HadoopFsRelation and FileCatalog, among others. It places a greater burden on other implementations of ExternalCatalog. Currently the only other implementation of ExternalCatalog is InMemoryCatalog, and my prototype throws an UnsupportedOperationException on that implementation.

The second prototype is simpler and only touches code in the hive project. Basically, conversion of a partitioned MetastoreRelation to HadoopFsRelation is deferred to physical planning. During physical planning, the partition pruning filters in the query plan are used to identify the required partition metadata and a HadoopFsRelation is built from those. The new query plan is then re-injected into the physical planner and proceeds as normal for a HadoopFsRelation.

On the Spark dev mailing list, ekhliang expressed a preference for the approach I took in my first POC. (See http://apache-spark-developers-list.1001551.n3.nabble.com/Scaling-partitioned-Hive-table-support-td18586.html) Based on that, I'm going to open a PR with that patch as a starting point for an architectural/design review. It will not be a complete patch ready for integration into Spark master. Rather, I would like to get early feedback on the implementation details so I can shape the PR before committing a large amount of time on a finished product. I will open another PR for the second approach for comparison if requested.

Attachments

Issue Links

is depended upon by

HADOOP-13525 Optimize uses of FS operations in the ASF analysis frameworks and libraries

Resolved

is duplicated by

SPARK-17179 Consider improving partition pruning in HiveMetastoreCatalog

Closed

is related to

SPARK-17861 Store data source partitions in metastore and push partition pruning into metastore

Resolved

links to

[Github] Pull Request #14690 (mallman)

Activity

People

Assignee:: Michael MacFadden

Reporter:: Michael MacFadden

Votes:: 6 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 09/Aug/16 18:15

Updated:: 02/Nov/16 13:52

Resolved:: 15/Oct/16 01:25