Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-4752

Add dedup support for MOR table in cli

    XMLWordPrintableJSON

Details

    Description

      repair dedup is supported only for COW table for now. we might need to support MOR table as well. 

       

      22/08/30 19:16:07 ERROR SparkMain: Fail to execute commandString
      org.apache.spark.sql.AnalysisException: cannot resolve '`_hoodie_record_key`' given input columns: []; line 5 pos 15;
      'UnresolvedHaving ('dupe_cnt > 1)
      +- 'Aggregate ['_hoodie_record_key], ['_hoodie_record_key AS dupe_key#0, count(1) AS dupe_cnt#1L]
         +- SubqueryAlias `htbl_1661912164735`
            +- LogicalRDD false
      hudi:hudi_trips_cow->
          at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
          at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:111)
          at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:108)
          at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:280)
          at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
          at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279)
          at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
          at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
          at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
          at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
          at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
          at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
          at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
          at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
          at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
          at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
          at scala.collection.AbstractTraversable.map(Traversable.scala:104)
          at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
          at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
          at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
          at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
          at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
          at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:108)
          at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
          at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
          at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
          at scala.collection.immutable.List.foreach(List.scala:392)
          at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
          at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:86)
          at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
          at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
          at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
          at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
          at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
          at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:58)
          at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:56)
          at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
          at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
          at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
          at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
          at org.apache.hudi.cli.DedupeSparkJob.getDupeKeyDF(DedupeSparkJob.scala:62)
          at org.apache.hudi.cli.DedupeSparkJob.planDuplicateFix(DedupeSparkJob.scala:88)
          at org.apache.hudi.cli.DedupeSparkJob.fixDuplicates(DedupeSparkJob.scala:195)
          at org.apache.hudi.cli.commands.SparkMain.deduplicatePartitionPath(SparkMain.java:413)
          at org.apache.hudi.cli.commands.SparkMain.main(SparkMain.java:110)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 

       

      Ref issue: (also contains steps to reproduce) 

      https://github.com/apache/hudi/issues/6194

       

       

       

      Attachments

        Activity

          People

            xichaomin xi chaomin
            shivnarayan sivabalan narayanan
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: