A frequent cause of "unknown disk id" warnings during query execution is that at the time of table loading one of the DNs holding relevant data was overloaded and could not give a timely response to dfs.getFileBlockStorageLocations() calls from the CatalogServer.
You will find messages similar to this in the catalogd logs at the time of table loading:
Also look for "Unknown disk id count for filesystem" in the catalogd logs to see how many missing disk ids were found in total.
This JIRA is for improving the error reporting dumped to the catalogd log when disk ids fail to load due to DN issues. In particular, the values for the following DN configuration options are often set pretty aggressively.
The logging should include the current setting of these configs and mention that increasing the might mitigate the disk id issues on a busy cluster.
In addition, we should consider enhancing the BE "unknown disk id" warning to include possible causes (heavy load on HDFS) and to recommend examining the catalogd logs for more information.
Note that this improvement is only relevant to Impala versions prior to
IMPALA-4172 because after that change we no longer contact the DNs for disk ids.