Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-139

Avoid reading file footers in parquet-avro InputFormat

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.6.0
    • None
    • None

    Description

      The AvroParquetInputFormat currently relies on the ParquetInputFormat that reads the footers for all of the files that will be processed. This is for two reasons:
      1. To plan splits (if using client side splits)
      2. To get a merged schema for all of the files

      Reading all of the footers is a bottle-neck when working with a large number of files and can significantly delay a job because only one machine is working. This should be done in parallel on the task side. PARQUET-84 added the ability to avoid reading footers on the client for split planning, so the difficult task is to avoid reading footers to merge the Parquet schema.

      To avoid merging the Parquet schema, the AvroParquetInputFormat should either use whatever schema a file contains or should reconcile the projection schema with the file schema on the task side.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            rdblue Ryan Blue
            rdblue Ryan Blue
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment