The fix for
IMPALA-4172/ IMPALA-3653 introduced a performance regression for loading tables that have many partitions with:
1. inconsistent HDFS path qualification or
2. a custom location (not under the table root dir)
For the first issue consider a table whose root path is at 'hdfs://localhost:8020/warehouse/tbl/'.
A partition with an unqualified location '/warehouse/tbl/p=1' will not be recognized as being a descendant of the table root dir by FileSystemUtil.isDescendentPath() because of how Path.equals() behaves, even if 'hdfs://localhost:8020' is the default filesystem.
Such partitions are incorrectly recognized as having a custom location and are treated specially. The treatment of such partitions is very inefficient, as show in the following code snippets:
HdfsTable.loadMetadataAndDiskIds() calls HdfsTable.loadBlockMetadata() once for every location:
As a result, it means that we will call isDescendentPath() roughly #numLocations * #totalPartitions times which can add up fast for tables with many partitions.
There are two issues to fix here:
1. The bug in recognizing partitions under the root table dir (for inconsistent qualification of table/partition locations)
2. The expensive loop for partitions with custom locations (even if legitimately custom)