[HUDI-3896] Support SchemaPruning optimization for Hudi's own relations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.12.0
Component/s: None
Labels:
- performance
- pull-request-available

Story Points:
3
Epic Link:
Performance Improvements

Description

After migrating to Hudi's own Relation impls, we unfortunately broke off some of the optimizations that Spark apply exclusively for `HadoopFsRelation`.

While these optimizations could be perfectly implemented for any `FileRelation`, Spark is unfortunately predicating them on usage of HadoopFsRelation, therefore making them non-applicable to any of the Hudi's relations.

Proper longterm solutions would be fixing this in Spark and could be either of:

Generalizing such optimizations to any `FileRelation`
Making `HadoopFsRelation` extensible (making it non-case class)

One example of this is Spark's `SchemaPrunning` optimization rule (~~HUDI-3891~~): Spark 3.2.x is able to effectively reduce amount of data read via schema pruning (projecting read data) even for nested structs, however this optimization is predicated on the usage of `HadoopFsRelation`:

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot 2022-04-16 at 1.46.50 PM.png
16/Apr/22 20:46
70 kB
Alexey Kudinkin

Issue Links

causes

HUDI-3891 Investigate Hudi vs Raw Parquet table discrepancy

Closed

relates to

HUDI-3882 Make sure Hudi Spark relations implementations provide similar file-scanning metrics

Open

links to

GitHub Pull Request #5428

Activity

People

Assignee:: Alexey Kudinkin

Reporter:: Alexey Kudinkin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Apr/22 20:43

Updated:: 25/Jul/22 21:28

Resolved:: 25/Jul/22 21:28