Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When testing Spark SQL Parquet support, we found that accessing large Parquet files located in S3 can be very slow. To be more specific, we have a S3 Parquet file with over 3,000 part-files, calling ParquetInputFormat.getSplits on it takes several minutes. (We were accessing this file from our office network rather than AWS.)
After some investigation, we found that ParquetInputFormat.getSplits is trying to call getFileStatus() on all part-files one by one sequentially (here). And in the case of S3, each getFileStatus() call issues an HTTP request and wait for the reply in a blocking manner, which is considerably expensive.
Actually all these FileStatus objects have already been fetched when footers are retrieved (here). Caching these FileStatus objects can greatly improve our S3 case (reduced from over 5 minutes to about 1.4 minutes).
Will submit a PR for this issue soon.
Attachments
Issue Links
- is related to
-
SPARK-2551 Cleanup FilteringParquetRowInputFormat
- Resolved
-
SPARK-2119 Reading Parquet InputSplits dominates query execution time when reading off S3
- Resolved
- relates to
-
PARQUET-4 Use LRU caching for footers in ParquetInputFormat.
- Resolved