[PARQUET-16] Unnecessary getFileStatus() calls on all part-files in ParquetInputFormat.getSplits - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: parquet-mr
Labels:
None

Description

When testing Spark SQL Parquet support, we found that accessing large Parquet files located in S3 can be very slow. To be more specific, we have a S3 Parquet file with over 3,000 part-files, calling ParquetInputFormat.getSplits on it takes several minutes. (We were accessing this file from our office network rather than AWS.)

After some investigation, we found that ParquetInputFormat.getSplits is trying to call getFileStatus() on all part-files one by one sequentially (here). And in the case of S3, each getFileStatus() call issues an HTTP request and wait for the reply in a blocking manner, which is considerably expensive.

Actually all these FileStatus objects have already been fetched when footers are retrieved (here). Caching these FileStatus objects can greatly improve our S3 case (reduced from over 5 minutes to about 1.4 minutes).

Will submit a PR for this issue soon.

Attachments

Issue Links

is related to

SPARK-2551 Cleanup FilteringParquetRowInputFormat

Resolved

SPARK-2119 Reading Parquet InputSplits dominates query execution time when reading off S3

Resolved

relates to

PARQUET-4 Use LRU caching for footers in ParquetInputFormat.

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Cheng Lian

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Jul/14 20:11

Updated:: 17/Jul/14 08:15