Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1879

Spark DataSource tables/HoodieFileIndex issues for Merge On Read

    XMLWordPrintableJSON

Details

    Description

      Read as DataSource Tables and HoodieFileIndex implementation that went inĀ https://github.com/apache/hudi/pull/2283 and https://github.com/apache/hudi/pull/2651 has introduced a couple of major regressions for Merge on Read tables:

      • _ro tables returning Snapshot results: Since we are directly using Hudi DataSource now to query _ro and _rt MOR tables, the DataSource has no way to recognize the difference between read optimized and real time tables as it has no way to check for table name. In both these scenarios QUERY_TYPE_OPT_KEY turns out to be snapshot by default, which is causing MergeOnReadSnapshotRelation to be used for querying thus returning snapshot results always.
      • Partition pruning does not work for realtime queries: The MergeOnReadSnapshotRelation is directly using allFiles to always fetch all the files without doing any partition pruning. This is a regression for Spark SQL real time queries because earlier partition pruning would work via InputFormat for these queries. Thus, it will have impact on rt queries performance.

      Attachments

        Issue Links

          Activity

            People

              pzw2018 pengzhiwei
              uditme Udit Mehrotra
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: