Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-4178

Performance regressions in Spark DataSourceV2 Integration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.11.0
    • 0.11.1
    • None

    Description

      There are multiple issues with our current DataSource V2 integrations:

      Because we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This poses following problems

      1. It doesn't fully implement Spark's protocol: for ex, this rule doesn't cache produced `LogicalPlan` making Spark re-create Hudi relations from scratch (including doing full table's file-listing) for every query reading this table. However, adding the caching in that sequence is not an option, since V2 APIs manage cache differently and therefore for us to be able to leverage that cache we will have to manage all of its lifecycle (adding, flushing)
      2. Additionally, HoodieSpark3Analysis rule does not pass table's schema from the Spark Catalog to Hudi's relations making them fetch the schema from storage (either from commit's metadata or data file) every time

       

      Attachments

        Issue Links

          Activity

            People

              alexey.kudinkin Alexey Kudinkin
              alexey.kudinkin Alexey Kudinkin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: