Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-5442

Fix HiveHoodieTableFileIndex to use lazy listing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 0.14.0
    • reader-core, trino-presto
    • None
    • 2

    Description

      Currently, HiveHoodieTableFileIndex hard-codes the shouldListLazily to false, using eager listing only.  This leads to scanning all table partitions in the file index, regardless of the queryPaths provided (for Trino Hive connector, only one partition is passed in).

      public HiveHoodieTableFileIndex(HoodieEngineContext engineContext,
                                      HoodieTableMetaClient metaClient,
                                      TypedProperties configProperties,
                                      HoodieTableQueryType queryType,
                                      List<Path> queryPaths,
                                      Option<String> specifiedQueryInstant,
                                      boolean shouldIncludePendingCommits
      ) {
        super(engineContext,
            metaClient,
            configProperties,
            queryType,
            queryPaths,
            specifiedQueryInstant,
            shouldIncludePendingCommits,
            true,
            new NoopCache(),
            false);
      } 

      After flipping it to true for testing, the following exception is thrown.

      io.trino.spi.TrinoException: Failed to parse partition column values from the partition-path: likely non-encoded slashes being used in partition column's values. You can try to work this around by switching listing mode to eager
          at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:284)
          at io.trino.plugin.hive.util.ResumableTasks$1.run(ResumableTasks.java:38)
          at io.trino.$gen.Trino_392____20221217_092723_2.run(Unknown Source)
          at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
          at java.base/java.lang.Thread.run(Thread.java:833)
      Caused by: org.apache.hudi.exception.HoodieException: Failed to parse partition column values from the partition-path: likely non-encoded slashes being used in partition column's values. You can try to work this around by switching listing mode to eager
          at org.apache.hudi.BaseHoodieTableFileIndex.parsePartitionColumnValues(BaseHoodieTableFileIndex.java:317)
          at org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPaths$6(BaseHoodieTableFileIndex.java:288)
          at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
          at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
          at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
          at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
          at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
          at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
          at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
          at org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:291)
          at org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:205)
          at org.apache.hudi.BaseHoodieTableFileIndex.getAllInputFileSlices(BaseHoodieTableFileIndex.java:216)
          at org.apache.hudi.hadoop.HiveHoodieTableFileIndex.listFileSlices(HiveHoodieTableFileIndex.java:71)
          at org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:263)
          at org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:158)
          at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
          at org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
          at io.trino.plugin.hive.BackgroundHiveSplitLoader.lambda$loadPartition$2(BackgroundHiveSplitLoader.java:493)
          at io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
          at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:97)
          at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:493)
          at io.trino.plugin.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:353)
          at io.trino.plugin.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:277)
          ... 6 more 

       

      Attachments

        Issue Links

          Activity

            People

              guoyihua Ethan Guo (this is the old account; please use "yihua")
              guoyihua Ethan Guo (this is the old account; please use "yihua")
              Shiyan Xu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: