Details
Description
I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition.
"describe sometable" also performs very poorly
Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236
I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy – describe table should be purely a metastore op IMO (i.e. query postgres, return types).
The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. "describe table" is not so interesting but I think this affects all query paths – I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-6910 Support for pushing predicates down to metastore for partition pruning
- Resolved