Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-139

Avoid reading file footers in parquet-avro InputFormat

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: None
    • Labels:
      None

      Description

      The AvroParquetInputFormat currently relies on the ParquetInputFormat that reads the footers for all of the files that will be processed. This is for two reasons:
      1. To plan splits (if using client side splits)
      2. To get a merged schema for all of the files

      Reading all of the footers is a bottle-neck when working with a large number of files and can significantly delay a job because only one machine is working. This should be done in parallel on the task side. PARQUET-84 added the ability to avoid reading footers on the client for split planning, so the difficult task is to avoid reading footers to merge the Parquet schema.

      To avoid merging the Parquet schema, the AvroParquetInputFormat should either use whatever schema a file contains or should reconcile the projection schema with the file schema on the task side.

        Attachments

          Activity

            People

            • Assignee:
              rdblue Ryan Blue
              Reporter:
              rdblue Ryan Blue
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: