In IcebergUtil.getIcebergDataFiles() we issue scan.planFiles():
scan.planFiles() needs to read the manifest files to return a list of files to be scanned. This unfortunately adds significant overhead to the plan time for short-running queries.
Maybe we can do the followings to mitigate this issue:
- cache TableScan.planFiles() without predicates being used, and use this instead of pushing predicates to Iceberg. It would need a logic to decide when to use the cached plan files and when to push down predicates
- Figure out if it is possible to cache manifest files so we don't need to re-read them for each table scan.
- If this is not possible then we might need to contribute code to Iceberg