Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
ghx-label-13
Description
To split up the Equality-delete read support task, let's deliver a patch for some initial support first. The idea here is that apparently Flink (one of the engines that can write equality delete files) can write only a subset of the possible equality delete use cases that are allowed by the Iceberg spec.
So as a first step let's deliver the functionality that is required to read the EQ-deletes written by Flink. The use case: when Flink writes EQ-deletes is for tables in upsert mode (primary key is a must in this case) in order to guarantee the uniqueness of the primary key fields, for each insert (that is in fact an upsert) Flink writes one delete file to remove the previous row with the given PK (even if there hasn't been any) and then writes data files with the new data.
How we can narrow down the functionality to be implemented on Impala side:
- The set of PK columns is not alterable, so we don't have to implement when different EQ-delete files have different equality field ID lists.
- Flink's ALTERĀ TABLE for Iceberg tables doesn't allow partition and schema evolution. We can reject queries on eq-delete tables where there was partition or schema evolution.
- As eq-deletes are written to NOT NULL PK's we could omit the case where there are NULLs in the eq-delete file. (Update, this seemed easy to solve, so will be part of this patch)
- For partitioned tables Flink requires the partition columns to be part of the PK. As a result each EQ-delete file will have the partition values too so no need to add extra logic to check if the partition spec ID and the partition values match between the data and delete files.