[IMPALA-3653] Consider using listLocatedStatus() API to get filestatus and blocklocations in one RPC call - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: Impala 2.5.0
Fix Version/s: Impala 2.8.0
Component/s: Catalog
Labels:
- catalog-server

Description

Currently, Impala needs to make 2-3 RPC calls to load HDFS block metadata for each file.
DistributedFileSystem.listLocatedStatus() is available long time ago, with https://issues.apache.org/jira/browse/HDFS-8887 (already backported to CDH5.5) this API can return all files' status and block locations under a directory in one RPC calls. That can greatly reduce HDFS metadata loading time. for example, for a directory with 200K files, the metadata loading time reduced from 40s to under 15s.

One concern is memory usage. StorageID is a UUID string, diskID is int32, this info is needed for each replica. If we just simply store storageID in catalog metadata, it will increase catalog metadata size and thrift object size(Impact catalog topic update, plan fragments). We should try to do mapping at fe to make sure memory usage not increase.

Also, Does it make sense to have a global map for host/TNetworkAddresses mapping? Currently Impala keeps one map per HdfsTable. with more tables and more nodes, this can still take quite some memory.
If we could use a global map, we could link the storageID for each host here as well. and all impalads and catalog will have the same global mapkey for host index, and no need to send the full value in thrift update objects or plan fragments.

Attachments

Issue Links

is related to

IMPALA-4172 Switch from using getFileBlockLocations to BlockLocation methods (Potential 50% speedup in metadata loading)

Resolved

Activity

People

Assignee:: Bharath Vissapragada

Reporter:: Juan Yu

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 31/May/16 21:13

Updated:: 31/Jan/17 19:03

Resolved:: 31/Jan/17 19:03