Affects Version/s: None
Fix Version/s: None
For AlterTable Recover Partitions, Impala compares the Hive partition names with the names of cached partitions to drop non-existing partitions and add new partitions. However, comparing by name strings is not enough since some partitions will have inconsistent names between Hive and Impala. This usually happens when the partition directory is created by non-Hive apps.
In the same scenario, REFRESH will reload these partitions twice, because they are added into both "removedPartitions" and "partitionsToLoadFiles". Code snipper:
So they are loaded as new partitions and loaded again as "partitionsToLoadFiles" in these two calls:
Let's say external table my_part_tbl is partitioned by (year int, month int, day int). User creates and uploads data to HDFS dir year=2020/month=01/day=01, and then triggers an AlterTable RecoverPartitions command in Impala. Impala will create partition (year=2020/month=1/day=1) in Hive using this location ".../year=2020/month=01/day=01".
Next time when running AlterTable RecoverPartition again (e.g. when new partition dirs are created again), the partition name list got from Hive is [year=2020/month=01/day=01]. However, the name list of cached partitions is [year=2020/month=1/day=1]. Impala will drop this partition and load it as a new partition.
This impacts the performance of AlterTable RecoverPartition on partitioned tables if the partition directories are all in such case. Many partitions will be reload and reload.
Found the table location is hdfs://localhost:20500/test-warehouse/my_part_tbl
Create and upload data to a partition dir using HDFS CLI:
Let Impala detect these partitions.
Then everytime when running AlterTable RecoverPartitions, these 4 partitions will be reloaded again. The logs of catalogd reflects this:
Running REFRESH on this table will see the partitions being reloaded twice.