Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6242

Format changes for Hudi 1.X release line

    XMLWordPrintableJSON

Details

    • Epic
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 1.0.0
    • core
    • 1.X Format Changes

    Description

      This EPIC tracks changes to the Hudi storage format.

      Proposals

      Format change is anything that changes any bits related to

        * Timeline : active or archived timeline contents, file names.

      • Base Files: file format versions, any changes to any data types, file footers, file names.
      • Log Files: Block structure, content, names.
      • Metadata Table: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings.
      • Table properties: What's written to hoodie.properties.
      • Marker files : Can be left to the writer implementation.

      Change summary:

      The following functionality should be supportable by the new format tech specs (at a minimum)
      Flexibility :

      • [Pending] Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...)
      • [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet as MT format, HFile native APIs)

      Metafields :

      • [Resolved] Should _recordkey be uuid special handling?
      • Semantics of _hoodie_commit_time , with completion time changes.

      Additional Info:

      • Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data.
      • [Resolved] Position based skipping of base file
      • [Pending] Additional metadata to avoid more RPCs to scan base file/log blocks.
      • [Pending] ML/Column family use-case?
      • [Resolved] Support having changeset of columns in each write, other headers

      Log :

      • [No change needed] Support writing updates as deletes and inserts, instead of logging as update to base file.
      • [Pending] CDC format is GA.

      Table organization:

      • [Pending] Support different logical partitions on the same data
      • [Pending] RFC-60/Storage of table spread across buckets/root folders
      • [Pending] Decouple table location from timeline, metadata. They can all be in different places

      Concurrency/Timeline:

      • [Pending] Ability to support general purpose multi-table transactions, esp between data and metadata tables.
      • [Pending] Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts.
      • [Resolved] Support for long lived instants in timeline, break down distinction between active/archived
      • [Pending] Support checking of uniqueness constraints, even in face of two concurrent insert transactions.
      • [Pending] Support precise time-travel queries
      • [Pending] Support time-travel writes.
      • [Pending] Support schema history tracking and aid in schema evol impl.
      • [Resolved] TrueTime store/support for instant times
      • [Pending] No more separate rollback action. make it a new state.

      Metadata table :

      • Encode filegroup ID and commit time along with file metadata

      Table Properties:

      • Partitioning information/indexing info

      Marker Files:

      • Write marker files for logs as well, based on new marker format.

      Attachments

        Activity

          People

            vinoth Vinoth Chandar
            vinoth Vinoth Chandar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: