Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11171

Impala still re-reads Iceberg manifest files for each SCAN node.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • ghx-label-11

    Description

      In IcebergUtil.getIcebergDataFiles() we issue scan.planFiles():
      https://github.com/apache/impala/blob/7f1ce039be30d5b36a490e8b07728f82f5d4c3de/fe/src/main/java/org/apache/impala/util/IcebergUtil.java#L534

      scan.planFiles() needs to read the manifest files to return a list of files to be scanned. This unfortunately adds significant overhead to the plan time for short-running queries.

      Maybe we can do the followings to mitigate this issue:

      • cache TableScan.planFiles() without predicates being used, and use this instead of pushing predicates to Iceberg. It would need a logic to decide when to use the cached plan files and when to push down predicates
      • Figure out if it is possible to cache manifest files so we don't need to re-read them for each table scan.
        • If this is not possible then we might need to contribute code to Iceberg

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            boroknagyz Zoltán Borók-Nagy
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment