Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7639

Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

    XMLWordPrintableJSON

Details

    Description

      Currently, `HoodieFileIndex` is responsible for partition pruning as well as file skipping. All indexes are being used in lookupCandidateFilesInMetadataTable method through if-else branches. This is not only hard to maintain as we add more indexes, but also induces a static hierarchy. Instead, we need more flexibility so that we can alter logical plan based on availability of indexes. For partition pruning in Spark, we already have HoodiePruneFileSourcePartitions rule but it is injected during the operator optimization batch and it does not modify the result of the LogicalPlan. To be fully extensible, we should be able to rewrite the LogicalPlan. We should be able to inject rules after partition pruning after the operator optimization batch and before any CBO rules that depend on stats. Spark provides injectPreCBORules API to do so, however it is only available in Spark 3.1.0 onwards.

      The goal of this ticket is to refactor index hierarchy and create new rules such that Spark version < 3.1.0 still go via the old path, while later versions can modify the plan using an appropriate index and inject as a pre-CBO rule.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              codope Sagar Sumit
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: