Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-460

Parquet files concat tool

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.0, 1.8.0
    • Fix Version/s: 1.9.0
    • Component/s: parquet-mr
    • Labels:
      None

      Description

      Currently the parquet file generation is time consuming, most of time used for serialize and compress. It cost about 10mins to generate a 100MB~ parquet file in our scenario. We want to improve write performance without generate too many small files, which will impact read performance.

      We propose to:
      1. generate several small parquet files concurrently
      2. merge small files to one file: concat the parquet blocks in binary (without SerDe), merge footers and modify the path and offset metadata.
      We create ParquetFilesConcat class to finish step 2. It can be invoked by parquet.tools.command.ConcatCommand. If this function approved by parquet community, we will integrate it in spark.

      It will impact compression and introduced more dictionary pages, but it can be improved by adjusting the concurrency of step 1.

        Attachments

          Activity

            People

            • Assignee:
              flykobe flykobe cheng
              Reporter:
              flykobe flykobe cheng
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: