Details
-
Epic
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
1.X Format Changes
Description
This EPIC tracks changes to the Hudi storage format.
Proposals
Format change is anything that changes any bits related to
* Timeline : active or archived timeline contents, file names.
- Base Files: file format versions, any changes to any data types, file footers, file names.
- Log Files: Block structure, content, names.
- Metadata Table: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings.
- Table properties: What's written to hoodie.properties.
- Marker files : Can be left to the writer implementation.
Change summary:
The following functionality should be supportable by the new format tech specs (at a minimum)
Flexibility :
- [Pending] Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...)
- [Pending] Easy integration of metadata for JVM and non-jvm clients (parquet as MT format, HFile native APIs)
Metafields :
- [Resolved] Should _recordkey be uuid special handling?
- Semantics of _hoodie_commit_time , with completion time changes.
Additional Info:
- Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data.
- [Resolved] Position based skipping of base file
- [Pending] Additional metadata to avoid more RPCs to scan base file/log blocks.
- [Pending] ML/Column family use-case?
- [Resolved] Support having changeset of columns in each write, other headers
Log :
- [No change needed] Support writing updates as deletes and inserts, instead of logging as update to base file.
- [Pending] CDC format is GA.
Table organization:
- [Pending] Support different logical partitions on the same data
- [Pending] RFC-60/Storage of table spread across buckets/root folders
- [Pending] Decouple table location from timeline, metadata. They can all be in different places
Concurrency/Timeline:
- [Pending] Ability to support general purpose multi-table transactions, esp between data and metadata tables.
- [Pending] Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts.
- [Resolved] Support for long lived instants in timeline, break down distinction between active/archived
- [Pending] Support checking of uniqueness constraints, even in face of two concurrent insert transactions.
- [Pending] Support precise time-travel queries
- [Pending] Support time-travel writes.
- [Pending] Support schema history tracking and aid in schema evol impl.
- [Resolved] TrueTime store/support for instant times
- [Pending] No more separate rollback action. make it a new state.
Metadata table :
- Encode filegroup ID and commit time along with file metadata
Table Properties:
- Partitioning information/indexing info
Marker Files:
- Write marker files for logs as well, based on new marker format.