Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-13382

OPTIMIZE could be more resistant to concurrent write operations

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Catalog
    • ghx-label-6

    Description

      When there is a concurrent modification for a data file that is being replaced by the OPTIMIZE statement, we can get the following error:

      Cannot commit, found new delete for replaced data file: ...
      

       Because of this we cannot commit OPTIMIZE, meaning all of its work is lost. Moreover, the newly written data files remain on storage as orphan files.

      To avoid such conflicts, we could do the followings, before commiting OPTIMIZE:

      1. Check if there is partition evolution involved in the file replacements. If so, let's just hope that the data files associated with deletes are not selected by OPTIMIZE, and jump straight to "commit OPTIMIZE". Otherwise:
      2. Check if there are new snapshots since the base snapshot of the OPTIMIZE statement
      3. If there are, then iterate over the snapshots
      4. Collect the delete files (possibly via Snapshot.addedDeleteFiles())
      5. Collect the set of partitions associated with delete files
      6. Filter the file replacements by excluding the affected partitions (all have current partition spec)
        • We can also remove the newly written data files belonging to the affected partitions
      7. Commit OPTIMIZE

      We need to do this at partition-level granularity as we don't exactly know which new data files replace which old files. If partition evolution is involved, then we have absolutely no idea which new data files hold the data records coming from old partitioning.

      Attachments

        Issue Links

          Activity

            People

              noemi Noemi Pap-Takacs
              boroknagyz Zoltán Borók-Nagy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: