Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Duplicate
-
1.10.0
-
None
-
None
Description
Hi Guys,
We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our use-case exactly (RocksDB state backend running in a containerised cluster). Unfortunately, it seems like there is a memory leak somewhere in the job submission logic. We are getting this error:
2020-02-18 10:22:10,020 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME switched from RUNNING to FAILED. java.lang.OutOfMemoryError: Metaspace at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:757) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) at org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) at org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.<clinit>(AwsSdkMetrics.java:359) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
(The only change in the above text is the OPERATOR_NAME text where I removed some of the internal specifics of our system).
This will reliably happen on a fresh cluster after submitting and cancelling our job 3 times.
We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
Please let me know what other diagnostics would be useful.
Tom
Attachments
Attachments
Issue Links
- is related to
-
FLINK-11205 Task Manager Metaspace Memory Leak
- Closed
-
FLINK-16246 Exclude "SdkMBeanRegistrySupport" from dynamically loaded AWS connectors
- Closed
-
FLINK-16406 Increase default value for JVM Metaspace to minimise its OutOfMemoryError
- Closed
- is superceded by
-
FLINK-16225 Metaspace Out Of Memory should be handled as Fatal Error in TaskManager
- Closed