[HUDI-6786] Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0-beta1
Component/s: None
Labels:
- pull-request-available

Story Points:
6
Epic Link:
1.X Api & Abstractions

Description

Goal: When `NewHoodieParquetFileFormat` is enabled with `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR Snapshot query should use HoodieFileGroupReader. All relevant tests on basic MOR snapshot query should pass (except for the caveats in the current HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in this EPIC).

The query logic is implemented in `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the following code for MOR snapshot query:

else {
  if (logFiles.nonEmpty) {
    val baseFile = createPartitionedFile(InternalRow.empty, hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
    buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, filePath.getParent, requiredSchemaWithMandatory,
      requiredSchemaWithMandatory, outputSchema, partitionSchema, partitionValues, broadcastedHadoopConf.value.value)
  } else {
    throw new IllegalStateException("should not be here since file slice should not have been broadcasted since it has no log or data files")
    //baseFileReader(baseFile)
  }

`buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, with a new config `hoodie.read.use.new.file.group.reader`, by passing in the correct base and log file list.

Attachments

Issue Links

links to

GitHub Pull Request #9819

Sub-Tasks

1.	Run TPC-DS benchmark on the integration		Closed	Lin Liu
2.	Support various partitioned file slice using HoodieFileGroupReader		Open	Lin Liu

Activity

People

Assignee:: Lin Liu

Reporter:: Ethan Guo (this is the old account; please use "yihua")

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Aug/23 23:00

Updated:: 27/Nov/24 14:55

Resolved:: 18/Oct/23 21:03