Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18199

Support appending to Parquet files

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Invalid
    • None
    • None
    • SQL
    • None

    Description

      Currently, appending to a Parquet directory involves simply creating new parquet files in the directory. With many small appends (for example, in a streaming job with a short batch duration) this leads to an unbounded number of small Parquet files accumulating. These must be cleaned up with some frequency by removing them all and rewriting a new file containing all the rows.

      It would be far better if Spark supported appending to the Parquet files themselves. HDFS supports this, as does Parquet:

      • The Parquet footer can be read in order to obtain necessary metadata.
      • The new rows can then be appended to the Parquet file as a row group.
      • A new footer can then be appended containing the metadata and referencing the new row groups as well as the previously existing row groups.

      This would result in a small amount of bloat in the file as new row groups are added (since duplicate metadata would accumulate) but it's hugely preferable to accumulating small files, which is bad for HDFS health and also eventually leads to Spark being unable to read the Parquet directory at all. Periodic rewriting of the file could still be performed in order to remove the duplicate metadata.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            jeremyrsmith Jeremy Smith
            Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment