[HUDI-4178] Performance regressions in Spark DataSourceV2 Integration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.11.0
Fix Version/s: 0.11.1
Component/s: None
Labels:
- pull-request-available

Story Points:
4
Epic Link:
Performance Improvements

Description

There are multiple issues with our current DataSource V2 integrations:

Because we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This poses following problems

It doesn't fully implement Spark's protocol: for ex, this rule doesn't cache produced `LogicalPlan` making Spark re-create Hudi relations from scratch (including doing full table's file-listing) for every query reading this table. However, adding the caching in that sequence is not an option, since V2 APIs manage cache differently and therefore for us to be able to leverage that cache we will have to manage all of its lifecycle (adding, flushing)
Additionally, HoodieSpark3Analysis rule does not pass table's schema from the Spark Catalog to Hudi's relations making them fetch the schema from storage (either from commit's metadata or data file) every time

Attachments

Issue Links

is related to

HUDI-4176 TableSchemaResolver fetches/parses HoodieCommitMetadata multiple times while extracting Schema

Closed

links to

GitHub Pull Request #5737

GitHub Pull Request #5791

Activity

People

Assignee:: Alexey Kudinkin

Reporter:: Alexey Kudinkin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Jun/22 02:51

Updated:: 25/Jul/22 21:31

Resolved:: 07/Jun/22 23:44