Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.11.0
-
None
Description
There are multiple issues with our current DataSource V2 integrations:
Because we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This poses following problems
- It doesn't fully implement Spark's protocol: for ex, this rule doesn't cache produced `LogicalPlan` making Spark re-create Hudi relations from scratch (including doing full table's file-listing) for every query reading this table. However, adding the caching in that sequence is not an option, since V2 APIs manage cache differently and therefore for us to be able to leverage that cache we will have to manage all of its lifecycle (adding, flushing)
- Additionally, HoodieSpark3Analysis rule does not pass table's schema from the Spark Catalog to Hudi's relations making them fetch the schema from storage (either from commit's metadata or data file) every time
Attachments
Issue Links
- is related to
-
HUDI-4176 TableSchemaResolver fetches/parses HoodieCommitMetadata multiple times while extracting Schema
- Closed
- links to