[SPARK-18199] Support appending to Parquet files - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: None
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Currently, appending to a Parquet directory involves simply creating new parquet files in the directory. With many small appends (for example, in a streaming job with a short batch duration) this leads to an unbounded number of small Parquet files accumulating. These must be cleaned up with some frequency by removing them all and rewriting a new file containing all the rows.

It would be far better if Spark supported appending to the Parquet files themselves. HDFS supports this, as does Parquet:

The Parquet footer can be read in order to obtain necessary metadata.
The new rows can then be appended to the Parquet file as a row group.
A new footer can then be appended containing the metadata and referencing the new row groups as well as the previously existing row groups.

This would result in a small amount of bloat in the file as new row groups are added (since duplicate metadata would accumulate) but it's hugely preferable to accumulating small files, which is bad for HDFS health and also eventually leads to Spark being unable to read the Parquet directory at all. Periodic rewriting of the file could still be performed in order to remove the duplicate metadata.

Attachments

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Jeremy Smith

Votes:: 3 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 01/Nov/16 18:09

Updated:: 30/Jun/17 04:26

Resolved:: 30/Jun/17 04:26

Agile

View on Board

Support appending to Parquet files

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment