[HUDI-1325] Implement in-memory merging of metadata table with the non-synced part of data timeline - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Epic Link:
Metadata Table for File Listing & Query Planning

Description

Here is a corner case with syncing completed compaction from data timeline to metadata timeline. Consider the following sequence of events

t0: writer schedules compaction at time instant c
t1: Compactor starts processing c's plan
t2: compaction finishes with c.commit published on the data timeline (not yet synced to metadata timeline)
t3: Next round of writing, writer opens metadata table, which adds the base file produced in c.commit to metadata table.

Any queries running between t2 and t3, cannot rely on metadata since the new base file will not be present in metadata table. The timeline will indicate that the compaction completed, and the latest file slice will be computed as simply the logs written to the file groups since compaction. This will lead to incorrect results.

If we consider just writer alone, we may be okay since we first sync the metadata table before we do anything for the delta commit at t3. But in general for queries, we should advise enabling metadata table based listings only, after all writers/cleaner/compactor have been enabled to use metadata and been successfully using it to publish new/deleted files directly to the metadata table. In short, queries cannot rely on metadata table, with the syncing mechanism as the main thing that keeps data and metadata timelines together.

Attachments

Activity

People

Assignee:: Ryan Pifer

Reporter:: Prashant Wason

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Oct/20 00:05

Updated:: 04/Jan/22 00:09

Resolved:: 29/Dec/20 22:40