Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-16

Unnecessary getFileStatus() calls on all part-files in ParquetInputFormat.getSplits

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • parquet-mr
    • None

    Description

      When testing Spark SQL Parquet support, we found that accessing large Parquet files located in S3 can be very slow. To be more specific, we have a S3 Parquet file with over 3,000 part-files, calling ParquetInputFormat.getSplits on it takes several minutes. (We were accessing this file from our office network rather than AWS.)

      After some investigation, we found that ParquetInputFormat.getSplits is trying to call getFileStatus() on all part-files one by one sequentially (here). And in the case of S3, each getFileStatus() call issues an HTTP request and wait for the reply in a blocking manner, which is considerably expensive.

      Actually all these FileStatus objects have already been fetched when footers are retrieved (here). Caching these FileStatus objects can greatly improve our S3 case (reduced from over 5 minutes to about 1.4 minutes).

      Will submit a PR for this issue soon.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: