Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
ghx-label-4
Description
HdfsPartition in catalogD is a collection of files and each file is represented by a FileDescriptor. A fd contains:
1. RelativePath of this file
2. Length of the file
3. Compression info like GZIP etc
4. Modification time of the file
5. Blocks info that belong to this file. Each block has info like offset, length, diskIds
When the event processor processes an AlterPartitionEvent, currently it reloads the partition again along with file metadata reloading. Reloading of file metadata is a relatively expensive operation as it involves listing files in the underlying filesystem. From the Impala shell, an alter partition is triggered via ALTER TABLE PARTITION <partition_spec> <operation>. Here operation can be:
- Update stats
- Drop stats
- Set file format
- Set row format
- Set table properties
- Unset table properties
- Set serde properties
- Unset serde properties
- Set cached <hdfs-pool-name>
- Unset cached <hdfs-pool-name>
- Set location
For transactional tables:
For transactional tables, if the incremental refresh is enabled, event processor reloades file metadata at the CommitTxn event. Since there is no way to know whether the commit txn event was due to alter_partition or some other event, file metadata reloading can not be skipped.
For external tables:
From the operations above, any operation that affects the underlying storage descriptor of a partition should trigger the file metadata reloading. Operations 3,4,7,8,11 are such operations.
How to detect change in file descriptor of a partition:
HMS partition object received in alter_partition event contains metastore.api.StorageDescriptor object. This object has fields like:
- List<FieldSchema> cols
- String location
- String inputFormat
- String outputFormat
- Boolean compressed
- Boolean numBuckets
- SerdeInfo serdeInfo
- LIst<String> bucketCols
- List<Order> sortCols
- Map<String, String> params
Fetch HMS partition object from alterPartition event and compare its storage descriptor properties with the similar properties of already cached partition object
Unknowns:
- If a partition is cached in HDFS, should we always reload its filemetadata (irrespective of any of the operations mentioned above) to get most up to date block locations?
cc - vihangk1 stigahuang hsnusonic
Attachments
Issue Links
- is related to
-
IMPALA-11534 Skip reloading file metadata for some ALTER_TABLE events
- Resolved