Description
It may be beneficial for data sources to represent updates as deletes and inserts for delta-based implementations. Specifically, it may be helpful to properly distribute and order records on write. Remember that delete records have only row ID and metadata attributes set. Update records have data, row ID, metadata attributes set. Insert records have only data attributes set.
For instance, a data source may rely on a metadata column _row_id (synthetic internally generated) to identify the row and is partitioned by bucket(product_id). Splitting updates into inserts and deletes would allow data sources to cluster all update and insert records for the same partition into a single task. Otherwise, the clustering key for updates and inserts will be different (updates have _row_id set). This is critical to reduce the number of generated files.