[HUDI-1879] Spark DataSource tables/HoodieFileIndex issues for Merge On Read - ASF JIRA

XML

Word

Printable

JSON

Read as DataSource Tables and HoodieFileIndex implementation that went in https://github.com/apache/hudi/pull/2283 and https://github.com/apache/hudi/pull/2651 has introduced a couple of major regressions for Merge on Read tables:

_ro tables returning Snapshot results: Since we are directly using Hudi DataSource now to query _ro and _rt MOR tables, the DataSource has no way to recognize the difference between read optimized and real time tables as it has no way to check for table name. In both these scenarios QUERY_TYPE_OPT_KEY turns out to be snapshot by default, which is causing MergeOnReadSnapshotRelation to be used for querying thus returning snapshot results always.
Partition pruning does not work for realtime queries: The MergeOnReadSnapshotRelation is directly using allFiles to always fetch all the files without doing any partition pruning. This is a regression for Spark SQL real time queries because earlier partition pruning would work via InputFormat for these queries. Thus, it will have impact on rt queries performance.

fixes

HUDI-1491 Support partition pruning for MOR snapshot query

links to

GitHub Pull Request #2925