[SPARK-8125] Accelerate ParquetRelation2 metadata discovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0

Description

For large Parquet tables (e.g., with thousands of partitions), it can be very slow to discover Parquet metadata for schema merging and generating splits for Spark jobs. We need to accelerate this processes. One possible solution is to do the discovery via a distributed Spark job.

Attachments

Issue Links

is duplicated by

SPARK-9347 spark load of existing parquet files extremely slow if large number of files

Resolved

is related to

SPARK-6795 Avoid reading Parquet footers on driver side when an global arbitrative schema is available

Resolved

relates to

SPARK-9072 Parquet : Writing data to S3 very slowly

Resolved

links to

[Github] Pull Request #7396 (liancheng)

[Github] Pull Request #7664 (liancheng)

Activity

People

Assignee:: Cheng Lian

Reporter:: Cheng Lian

Shepherd:: Cheng Lian

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/Jun/15 11:03

Updated:: 06/Aug/15 14:47

Resolved:: 20/Jul/15 23:43