Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-16142

Memory Leak causes Metaspace OOM error on repeated job submission

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Duplicate
    • 1.10.0
    • None
    • None

    Description

      Hi Guys,

      We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our use-case exactly (RocksDB state backend running in a containerised cluster). Unfortunately, it seems like there is a memory leak somewhere in the job submission logic. We are getting this error:

      2020-02-18 10:22:10,020 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME switched from RUNNING to FAILED.
      java.lang.OutOfMemoryError: Metaspace
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
      at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
      at org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
      at org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
      at org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
      at org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.<clinit>(AwsSdkMetrics.java:359)
      at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
      at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
      at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
      at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
      at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
      at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
      at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
      at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
      at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
      at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
      at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
      at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
      at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
      at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
      

      (The only change in the above text is the OPERATOR_NAME text where I removed some of the internal specifics of our system).

      This will reliably happen on a fresh cluster after submitting and cancelling our job 3 times.

      We are using the presto-s3 plugin, the CEP library and the Kinesis connector.

      Please let me know what other diagnostics would be useful.

      Tom

      Attachments

        1. java_pid1.hprof
          47.24 MB
          Thomas Wozniakowski
        2. Leak-GC-root.png
          115 kB
          Stephan Ewen
        3. java_pid1.hprof
          35.02 MB
          Thomas Wozniakowski

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            azagrebin Andrey Zagrebin
            Jamalarm Thomas Wozniakowski
            Votes:
            0 Vote for this issue
            Watchers:
            26 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment