Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-2943

Deltastreamer fails to continue with pending clustering after restart in 0.10.0 and inline clustering

    XMLWordPrintableJSON

Details

    Description

      Deltastreamer fails to restart when there is a pending clustering commit from a previous run with Upsert failed exception when inline clustering is on.

      Note: workaround of running Clustering job with --retry-last-failed-clustering-job works

      Hudi version : 0.10.0

      Spark version : 3.1.2

      EMR : 6.4.0
      diagnostics: User class threw exception: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20211206081248919
      at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62)
      at org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
      at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119)
      at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103)
      at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:159)
      at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:501)
      at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:306)
      at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
      at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
      at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
      at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
      Caused by: org.apache.hudi.exception.HoodieClusteringUpdateException: Not allowed to update the clustering file group HoodieFileGroupId{partitionPath='', fileId='39ca735d-1fc4-40f9-a314-93744642b38c-0'}. For pending clustering operations, we are not going to support update for now.
      at org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy.lambda$handleUpdate$0(SparkRejectUpdateStrategy.java:65)

      Config:

      hoodie.index.type=GLOBAL_SIMPLE
      hoodie.datasource.write.partitionpath.field=
      hoodie.datasource.write.precombine.field=updatedate
      hoodie.datasource.hive_sync.database=datalake
      hoodie.datasource.write.operation=upsert
      hoodie.datasource.hive_sync.table=hudi.prd.surveys
      hoodie.datasource.hive_sync.mode=hms
      hoodie.datasource.hive_sync.enable=false
      hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
      hoodie.datasource.hive_sync.use_jdbc=false
      hoodie.datasource.write.recordkey.field=id
      hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
      hoodie.datasource.write.hive_style_partitioning=true
      hoodie.finalize.write.parallelism=256
      hoodie.deltastreamer.source.dfs.root=s3://datalake-bucket/raw/parquet/data/surveys/year=2021/month=12/day=06/hour=16
      hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
      hoodie.parquet.max.file.size=134217728
      hoodie.parquet.small.file.limit=67108864
      hoodie.parquet.block.size=134217728
      hoodie.parquet.compression.codec=snappy
      hoodie.file.listing.parallelism=256
      hoodie.upsert.shuffle.parallelism=10
      hoodie.metadata.enable=false
      hoodie.metadata.clean.async=true
      hoodie.clustering.preserve.commit.metadata=true
      hoodie.clustering.inline.max.commits=1
      hoodie.clustering.inline=true
      hoodie.clustering.plan.strategy.target.file.max.bytes=134217728
      hoodie.clustering.plan.strategy.small.file.limit=67108864
      hoodie.clustering.plan.strategy.sort.columns=projectid
      hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
      hoodie.clean.async=true
      hoodie.clean.automatic=true
      hoodie.cleaner.policy=KEEP_LATEST_COMMITS
      hoodie.cleaner.commits.retained=10
      hoodie.deltastreamer.transformer.sql=SELECT id, sid FROM <SRC> a

      Attachments

        1. image-2021-12-08-15-10-02-420.png
          747 kB
          Harsha Teja Kanna

        Activity

          People

            shivnarayan sivabalan narayanan
            h7kanna Harsha Teja Kanna
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: