[IMPALA-11171] Impala still re-reads Iceberg manifest files for each SCAN node. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Not Applicable
Component/s: Frontend
Labels:
- impala-iceberg

Epic Color:
ghx-label-11

Description

In IcebergUtil.getIcebergDataFiles() we issue scan.planFiles():
https://github.com/apache/impala/blob/7f1ce039be30d5b36a490e8b07728f82f5d4c3de/fe/src/main/java/org/apache/impala/util/IcebergUtil.java#L534

scan.planFiles() needs to read the manifest files to return a list of files to be scanned. This unfortunately adds significant overhead to the plan time for short-running queries.

Maybe we can do the followings to mitigate this issue:

cache TableScan.planFiles() without predicates being used, and use this instead of pushing predicates to Iceberg. It would need a logic to decide when to use the cached plan files and when to push down predicates
Figure out if it is possible to cache manifest files so we don't need to re-read them for each table scan.
- If this is not possible then we might need to contribute code to Iceberg

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot 2022-03-12 at 3.23.28 PM.png
12/Mar/22 23:26
108 kB
Riza Suminto

Issue Links

is related to

IMPALA-11658 Implement Iceberg manifest caching configuration for Impala

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Zoltán Borók-Nagy

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 09/Mar/22 10:55

Updated:: 16/Dec/22 20:12

Resolved:: 04/Nov/22 12:10