Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1289

Spec for Updateable Parquet

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • parquet-format
    • None

    Description

      Parquet today is a read only columnar format, but can we also make it updateable using the methods in Apache Arrow for row filtering?

      Here's how it would work:

      A. Add an insert timestamp for every single record in a parquet file.
      B. Add a list of modifiable row offsets to the parquet file's footer for records in the parquet file which have been logically deleted. We should also include the delete timestamp for every offset as well in order to reproduce snapshot of what data looked like at any point in time.
      C. If a parquet record is ever update. The updated record would be a new record in a different parquet file and the old record in the parquet file would be logically deleted by adding its row offset to its parquet file's footer. We would need a service that does this.
      D. When reading parquet files. Logically deleted rows would be excluded.
      E. Alternatively when reading parquet files with a snapshot time any rows in the parquet files with an insert timestamp > snapshot time would be excluded and any rows which have been logically flagged for deletion would be included if delete timestamp < snapshot time.

      This way we do not have to reorganize the columnar data in existing parquet files. We just have to modify the metadata footer.

      Attachments

        Activity

          People

            Unassigned Unassigned
            davlee1972@yahoo.com David Lee
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: