Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
ghx-label-12
Description
Partitioned tables use multiple threads to list files in different partitions, but access permission checks are done before this on a single thread. IMPALA-7320 optimized this for full table loads (more exactly if a high percentage of partitions have changed), but in some cases we still do namenode RPCs on a single thread to get access level:
1. as mentioned above, if only a small subset of partitions are changed
2. if the path has ACL (access control list), then after getting file status an extra getAclStatus RPC is done, leading to partition_count number of RPCs on a single thread if ACL is enabled for all partitions
3. if there is some error when doing the optimized path
1. is especially problematic for metastore event processing, as partition events will often change only a subset of partitions. Even if all partitions are changed, the catalogd may not process them in one batch (see IMPALA-12463), leading to choosing the unoptimized path for several smaller batches
Besides the optimization in IMPALA-7320, there is no good reason for doing access level check on a single thread, so both case 1 and 2 good be made faster by moving to the multithreaded stage of table loading.
Note it is also a question whether all these access permission checks are really needed, see IMPALA-12472.
An anomaly caused by doing these on a single thread is that the affect of flag max_hdfs_partitions_parallel_load can be ambiguous - while it can significantly speed up loading tables with multiple partitions, if the namenode (or the thread that communicates with namenode) is contended, then parallel loads will get an unfair share of the limited resources, meaning the tables where large amount of work is done on single thread can actually get slower.
Attachments
Issue Links
- relates to
-
IMPALA-12472 Skip permission check when refreshing in event processor
- Open
-
IMPALA-7320 Loading HDFS tables calls getFileStatus on each partition serially
- Resolved