[HUDI-4176] TableSchemaResolver fetches/parses HoodieCommitMetadata multiple times while extracting Schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.11.0
Fix Version/s: 0.11.1
Component/s: None
Labels:
- pull-request-available

Story Points:
2
Epic Link:
Performance Improvements

Description

We've recently discovered that TableSchemaResolver does a lot of throw-away work during initialization and basic schema reading performed by Spark Datasource (see screenshot).

This poses a problem for large tables where HoodieCommitMetadata is of non-trivial size (100s of Mbs).

We'd minimize amount of throw-away work done by `TableSchemaResolver` and try to re-use read/parsed commits' metadata as much as possible.

Attachments

Issue Links

relates to

HUDI-3626 Refactor TableSchemaResolver to remove `includeMetadataFields` flags

Open

HUDI-4178 Performance regressions in Spark DataSourceV2 Integration

Closed

links to

GitHub Pull Request #5733

Activity

People

Assignee:: Alexey Kudinkin

Reporter:: Alexey Kudinkin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Jun/22 21:30

Updated:: 07/Jun/22 06:53

Resolved:: 07/Jun/22 06:53