Description
For large Parquet tables (e.g., with thousands of partitions), it can be very slow to discover Parquet metadata for schema merging and generating splits for Spark jobs. We need to accelerate this processes. One possible solution is to do the discovery via a distributed Spark job.
Attachments
Issue Links
- is duplicated by
-
SPARK-9347 spark load of existing parquet files extremely slow if large number of files
- Resolved
- is related to
-
SPARK-6795 Avoid reading Parquet footers on driver side when an global arbitrative schema is available
- Resolved
- relates to
-
SPARK-9072 Parquet : Writing data to S3 very slowly
- Resolved
- links to