Currently, Impala needs to make 2-3 RPC calls to load HDFS block metadata for each file.
DistributedFileSystem.listLocatedStatus() is available long time ago, with https://issues.apache.org/jira/browse/HDFS-8887 (already backported to CDH5.5) this API can return all files' status and block locations under a directory in one RPC calls. That can greatly reduce HDFS metadata loading time. for example, for a directory with 200K files, the metadata loading time reduced from 40s to under 15s.
One concern is memory usage. StorageID is a UUID string, diskID is int32, this info is needed for each replica. If we just simply store storageID in catalog metadata, it will increase catalog metadata size and thrift object size(Impact catalog topic update, plan fragments). We should try to do mapping at fe to make sure memory usage not increase.
Also, Does it make sense to have a global map for host/TNetworkAddresses mapping? Currently Impala keeps one map per HdfsTable. with more tables and more nodes, this can still take quite some memory.
If we could use a global map, we could link the storageID for each host here as well. and all impalads and catalog will have the same global mapkey for host index, and no need to send the full value in thrift update objects or plan fragments.