Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-8717 Reader feature standardization - Phase 0
  3. HUDI-8654

Support correct merging results with record positions in log blocks generated during pending compaction

Details

    • Sub-task
    • Status: Patch Available
    • Blocker
    • Resolution: Unresolved
    • None
    • 1.0.1
    • None

    Description

      When there is a pending compaction, the new base files to be generated by compaction is not available during this transaction. Given the log files in MOR from this transaction can be attached to the base file generated by the compaction in the latest file slice, the accurate record positions may not be derived.  However, the log files written in later delta commits after completed compaction have accurate positions.

      Similarly, for NBCC, the compaction can be schedule during an inflight deltacommit, and in this case the log file generated by the inflight deltacommit is associated with the new base file from the compaction, which may have different positions because of deletes.

      We need to make sure that the file group reader with position-based merging generates the correct results in such mix of log blocks.

      Attachments

        Issue Links

          Activity

            yihua Y Ethan Guo added a comment - - edited

            Problem

            When NBCC or async compaction happens, there is a chance that the positions generated at the time of writing the log blocks can be inaccurate for the snapshot read or compaction at a later time due to new file slicing.
             
            The heart of the problem is that when generating the positions, it is based on the current base file available; the snapshot read or compaction at a later time can rebase the log file onto a new base file based on the completion time-based file slicing. If the new base file is generated from the old base file with deletes in log files, the positions will be wrong, and the merging results will also be wrong.
             
            Take the following example, when writing .fg1_ts6.log , compaction ts7 is requested, hasn't completed, so the base file to fetch positions for updates/deletes is fg1_ts1.parquet . After compaction happened generating fg1_ts7.parquet and the .fg1_ts6.log has completion time ts8 , fg1_ts7.parquet and .fg1_ts6.log belong to the latest file slice now based on the completion, but the positions in .fg1_ts6.log cannot be used for merging against fg1_ts7.parquet . Note that there is no issue with positions for .fg1_ts2.log and .fg1_ts4.log since the base file attached to the file slice does not change over time so the positions can still be used for merging with correctness.

            fg1_ts1.parquet                         (fg1_ts7.parquet)
                                                     from compaction
                          .fg1_ts2.log  .fg1_ts4.log                  .fg1_ts6.log
                 (completion time ts3) (completion time ts5)      (completion time ts8)
                                                      written before fg1_ts5.parquet is generated
             

            Proposal

            We should always write positions, and let the merger to decide whether to use positional merging for correctness.

            Design Option 1

            Add the base instant time for the positions generated against to the log block header.
            When doing merging, if the base instant time for positions does not match the base file instant time, do not use positions for merging the records in this log block. This is simple and straightforward and can avoid any confusion if file slicing, particularly the base file, changed for a log file and block.

            Design Option 2

            No new metadata. Rely on the relationship between base instant time, log file instant time, and completion time to determine the base instant time for the positions on the fly and whether to use positional merging.
            In this case, we need to determine the base instant time for the positions written in .fg1_ts6.log on the fly. There are two drawbacks:

            • the time of writing new base file ( fg1_ts7.parquet ) and log file ( .fg1_ts6.log ) may not indicate the ordering of when these files are written, e.g., fg1_ts7.parquet can still be written before .fg1_ts6.log . So we'll need a slightly complex condition to determine the base instant time for the positions written in the log block, which is error-prone.
            • We need to lookup completion time here, potentially reading LSM timeline, which is another overhead.
            yihua Y Ethan Guo added a comment - - edited Problem When NBCC or async compaction happens, there is a chance that the positions generated at the time of writing the log blocks can be inaccurate for the snapshot read or compaction at a later time due to new file slicing.   The heart of the problem is that when generating the positions, it is based on the current base file available; the snapshot read or compaction at a later time can rebase the log file onto a new base file based on the completion time-based file slicing. If the new base file is generated from the old base file with deletes in log files, the positions will be wrong, and the merging results will also be wrong.   Take the following example, when writing .fg1_ts6.log , compaction ts7 is requested, hasn't completed, so the base file to fetch positions for updates/deletes is fg1_ts1.parquet . After compaction happened generating fg1_ts7.parquet and the .fg1_ts6.log has completion time ts8 , fg1_ts7.parquet and .fg1_ts6.log belong to the latest file slice now based on the completion, but the positions in .fg1_ts6.log cannot be used for merging against fg1_ts7.parquet . Note that there is no issue with positions for .fg1_ts2.log and .fg1_ts4.log since the base file attached to the file slice does not change over time so the positions can still be used for merging with correctness. fg1_ts1.parquet (fg1_ts7.parquet) from compaction .fg1_ts2.log .fg1_ts4.log .fg1_ts6.log (completion time ts3) (completion time ts5) (completion time ts8) written before fg1_ts5.parquet is generated Proposal We should always write positions, and let the merger to decide whether to use positional merging for correctness. Design Option 1 Add the base instant time for the positions generated against to the log block header. When doing merging, if the base instant time for positions does not match the base file instant time, do not use positions for merging the records in this log block. This is simple and straightforward and can avoid any confusion if file slicing, particularly the base file, changed for a log file and block. Design Option 2 No new metadata. Rely on the relationship between base instant time, log file instant time, and completion time to determine the base instant time for the positions on the fly and whether to use positional merging. In this case, we need to determine the base instant time for the positions written in .fg1_ts6.log on the fly. There are two drawbacks: the time of writing new base file ( fg1_ts7.parquet ) and log file ( .fg1_ts6.log ) may not indicate the ordering of when these files are written, e.g., fg1_ts7.parquet can still be written before .fg1_ts6.log . So we'll need a slightly complex condition to determine the base instant time for the positions written in the log block, which is error-prone. We need to lookup completion time here, potentially reading LSM timeline, which is another overhead.
            yihua Y Ethan Guo added a comment -

            We'll go with Design Option 1, and such base instant time is an important piece of metadata.

            yihua Y Ethan Guo added a comment - We'll go with Design Option 1, and such base instant time is an important piece of metadata.

            People

              yihua Y Ethan Guo
              yihua Y Ethan Guo
              sivabalan narayanan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 20h
                  20h
                  Remaining:
                  Time Spent - 4h Remaining Estimate - 16h
                  16h
                  Logged:
                  Time Spent - 4h Remaining Estimate - 16h
                  4h