Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-3896

Support SchemaPruning optimization for Hudi's own relations

    XMLWordPrintableJSON

Details

    Description

      After migrating to Hudi's own Relation impls, we unfortunately broke off some of the optimizations that Spark apply exclusively for `HadoopFsRelation`.

       

      While these optimizations could be perfectly implemented for any `FileRelation`, Spark is unfortunately predicating them on usage of HadoopFsRelation, therefore making them non-applicable to any of the Hudi's relations.

      Proper longterm solutions would be fixing this in Spark and could be either of:

      1. Generalizing such optimizations to any `FileRelation`
      2. Making `HadoopFsRelation` extensible (making it non-case class)

       

      One example of this is Spark's `SchemaPrunning` optimization rule (HUDI-3891): Spark 3.2.x is able to effectively reduce amount of data read via schema pruning (projecting read data) even for nested structs, however this optimization is predicated on the usage of `HadoopFsRelation`:

      Attachments

        Issue Links

          Activity

            People

              alexey.kudinkin Alexey Kudinkin
              alexey.kudinkin Alexey Kudinkin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: