Details
-
New Feature
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
None
-
None
Description
Goal: When `NewHoodieParquetFileFormat` is enabled with `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR Snapshot query should use HoodieFileGroupReader. All relevant tests on basic MOR snapshot query should pass (except for the caveats in the current HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in this EPIC).
The query logic is implemented in `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the following code for MOR snapshot query:
else { if (logFiles.nonEmpty) { val baseFile = createPartitionedFile(InternalRow.empty, hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen) buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, filePath.getParent, requiredSchemaWithMandatory, requiredSchemaWithMandatory, outputSchema, partitionSchema, partitionValues, broadcastedHadoopConf.value.value) } else { throw new IllegalStateException("should not be here since file slice should not have been broadcasted since it has no log or data files") //baseFileReader(baseFile) }
`buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, with a new config `hoodie.read.use.new.file.group.reader`, by passing in the correct base and log file list.
Attachments
Issue Links
- links to