[SPARK-6984] Operations on tables with many partitions _very_slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.2.1
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
None
Environment:

External Hive metastore, table with 30K partitions

Description

I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition.
"describe sometable" also performs very poorly

Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189

Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236

I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy – describe table should be purely a metastore op IMO (i.e. query postgres, return types).

The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. "describe table" is not so interesting but I think this affects all query paths – I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

7282_partitions_stack.png
17/Apr/15 16:26
234 kB
Yana Kadiyska

Issue Links

duplicates

SPARK-6910 Support for pushing predicates down to metastore for partition pruning

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Yana Kadiyska

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Apr/15 16:24

Updated:: 28/Jul/15 05:37

Resolved:: 28/Jul/15 05:37