Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7107

Reused MetricsReporter fails to publish metrics in Spark streaming job

    XMLWordPrintableJSON

Details

    Description

      A customer runs AWS Glue 4.0 streaming job (based on Spark 3.3.0) using Apache Hudi 0.14.0 libraries. The customer enabled Hudi CW metrics reporter in the job. It succeeded to publish metrics at first batch, however, after that it started to fail. Therefore there’s only one data sample published to CloudWatch metrics. The error stacktrace is as follows:

      2023-11-09 15:59:17,775 ERROR [stream execution thread for [id = d31c62e2-e697-40b7-b6da-854cf9a8cb14, runId = 0f2325c6-83f4-4dfa-a849-3b3189423a9b]] cloudwatch.CloudWatchReporter (CloudWatchReporter.java:report(236)): Error reporting metrics to CloudWatch. The data in this CloudWatch request may have been discarded, and not made it to CloudWatch.
      java.util.concurrent.ExecutionException: org.apache.hudi.software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: event executor terminated
          at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_382]
          at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_382]
          at org.apache.hudi.aws.cloudwatch.CloudWatchReporter.report(CloudWatchReporter.java:234) ~[hudi-aws-bundle-0.14.0.jar:0.14.0]
          at org.apache.hudi.aws.cloudwatch.CloudWatchReporter.report(CloudWatchReporter.java:211) ~[hudi-aws-bundle-0.14.0.jar:0.14.0]
          at org.apache.hudi.com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:237) ~[hudi-aws-bundle-0.14.0.jar:0.14.0]
          at org.apache.hudi.metrics.cloudwatch.CloudWatchMetricsReporter.report(CloudWatchMetricsReporter.java:71) ~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0]
          at java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_382]
          at org.apache.hudi.metrics.Metrics.shutdown(Metrics.java:116) ~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0]
          at java.util.HashMap$Values.forEach(HashMap.java:982) ~[?:1.8.0_382]
          at org.apache.hudi.metrics.Metrics.shutdownAllMetrics(Metrics.java:88) ~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0]
          at org.apache.hudi.HoodieSparkSqlWriter$.cleanup(HoodieSparkSqlWriter.scala:937) ~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0]
          at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:151) ~[hudi-utilities-bundle_2.12-0.14.0.jar:0.14.0] 

      This error comes from AWS Java SDK v2:

      Caused by: java.util.concurrent.RejectedExecutionException: event executor terminated
       at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:923) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
       at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:350) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
       at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:343) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
       at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:825) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
       at io.netty.util.concurrent.SingleThreadEventExecutor.lazyExecute(SingleThreadEventExecutor.java:820) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
       at io.netty.util.concurrent.AbstractScheduledEventExecutor.schedule(AbstractScheduledEventExecutor.java:263) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
       at io.netty.util.concurrent.AbstractScheduledEventExecutor.schedule(AbstractScheduledEventExecutor.java:177) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
       at io.netty.util.concurrent.AbstractEventExecutorGroup.schedule(AbstractEventExecutorGroup.java:50) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
       at org.apache.hudi.software.amazon.awssdk.http.nio.netty.internal.DelegatingEventLoopGroup.schedule(DelegatingEventLoopGroup.java:153) ~[hudi-aws-bundle-0.14.0.jar:0.14.0]
      

      I've observed the MetricsReporter is shutdown after the 1st batch, however, the MetricsReporter instance is reused in the subsequent batches and it fails to report metrics.

      Given MetricsReporter is not implemented as reusable, we should avoid reusing them.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              aajisaka Akira Ajisaka
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: