Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6888

Optimize scanInternalv2 to single pass

    XMLWordPrintableJSON

Details

    Description

      The current algorithm take two passes over the log blocks: # First pass to collect all the valid blocks alongwith block instant times including rollback block's target instant time.

      1. Second pass, in rever order of block instant time, to track final compacted instant times for each block.

      Now that we have removed appending to the same log file for multiple deltacommits, we can probably scan in single pass by keeping an active list or hash map of block times to their corresponding block, updating as we go. Should be tested for:

      1. Out of order merged blocks: Log compaction is scheduled and by the time it appended a block, another block is added by another writer.
      2. Log compaction operation failed, so a rollback is issued for this block. Here the rollback can be next block or can come at a later point of time.
      3. Log compaction is executing and, before committing, compaction starts running on the same file group.

      Attachments

        Issue Links

          Activity

            People

              codope Sagar Sumit
              codope Sagar Sumit
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: