Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6786

Integrate FileGroupReader with NewHoodieParquetFileFormat for Spark MOR Snapshot Query

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • None
    • 1.0.0-beta1
    • None

    Description

      Goal: When `NewHoodieParquetFileFormat` is enabled with `hoodie.datasource.read.use.new.parquet.file.format=true` on Spark, the MOR Snapshot query should use HoodieFileGroupReader.  All relevant tests on basic MOR snapshot query should pass (except for the caveats in the current HoodieFileGroupReader, see other open tickets around HoodieFileGroupReader in this EPIC).

      The query logic is implemented in `NewHoodieParquetFileFormat#buildReaderWithPartitionValues`; see the following code for MOR snapshot query:

      else {
        if (logFiles.nonEmpty) {
          val baseFile = createPartitionedFile(InternalRow.empty, hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
          buildMergeOnReadIterator(preMergeBaseFileReader(baseFile), logFiles, filePath.getParent, requiredSchemaWithMandatory,
            requiredSchemaWithMandatory, outputSchema, partitionSchema, partitionValues, broadcastedHadoopConf.value.value)
        } else {
          throw new IllegalStateException("should not be here since file slice should not have been broadcasted since it has no log or data files")
          //baseFileReader(baseFile)
        } 

      `buildMergeOnReadIterator` should be replaced by `HoodieFileGroupReader`, with a new config `hoodie.read.use.new.file.group.reader`, by passing in the correct base and log file list.

      Attachments

        Issue Links

          Activity

            People

              linliu Lin Liu
              guoyihua Ethan Guo (this is the old account; please use "yihua")
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: