Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Cannot Reproduce
-
1.5.4
-
None
-
OS: Debuan rodente 4.17
Flink version: 1.5.4
Key Value jobmanager.heap.mb 1024 jobmanager.rpc.address localhost jobmanager.rpc.port 6123 metrics.reporter.jmx.class org.apache.flink.metrics.jmx.JMXReporter metrics.reporter.jmx.port 9250-9260 metrics.reporters jmx parallelism.default 1 rest.port 8081 taskmanager.heap.mb 1024 taskmanager.numberOfTaskSlots 1 web.tmpdir /tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26 Overview
Data Port All Slots Free Slots CPU Cores Physical Memory JVM Heap Size Flink Managed Memory 43501 1 0 12 62.9 GB 922 MB 642 MB Memory
JVM (Heap/Non-Heap)
Type Committed Used Maximum Heap 922 MB 575 MB 922 MB Non-Heap 68.8 MB 64.3 MB -1 B Total 991 MB 639 MB 922 MB Outside JVM
Type Count Used Capacity Direct 3,292 105 MB 105 MB Mapped 0 0 B 0 B Network
Memory Segments
Type Count Available 3,194 Total 3,278 Garbage Collection
Collector Count Time G1_Young_Generation 13 336 G1_Old_Generation 1 21 OS: Debuan rodente 4.17 Flink version: 1.5.4 Key Value jobmanager.heap.mb 1024 jobmanager.rpc.address localhost jobmanager.rpc.port 6123 metrics.reporter.jmx.class org.apache.flink.metrics.jmx.JMXReporter metrics.reporter.jmx.port 9250-9260 metrics.reporters jmx parallelism.default 1 rest.port 8081 taskmanager.heap.mb 1024 taskmanager.numberOfTaskSlots 1 web.tmpdir /tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26 Overview Data Port All Slots Free Slots CPU Cores Physical Memory JVM Heap Size Flink Managed Memory 43501 1 0 12 62.9 GB 922 MB 642 MB Memory JVM (Heap/Non-Heap) Type Committed Used Maximum Heap 922 MB 575 MB 922 MB Non-Heap 68.8 MB 64.3 MB -1 B Total 991 MB 639 MB 922 MB Outside JVM Type Count Used Capacity Direct 3,292 105 MB 105 MB Mapped 0 0 B 0 B Network Memory Segments Type Count Available 3,194 Total 3,278 Garbage Collection Collector Count Time G1_Young_Generation 13 336 G1_Old_Generation 1 21
Description
I am running a fairly complex pipleline with 200+ task.
The pipeline works fine with small data (order of 10kb input) but gets stuck with a slightly larger data (300kb input).
The task gets stuck while writing the output toFlink, more specifically it gets stuck while requesting memory segment in local buffer pool. The Task manager UI shows that it has enough memory and memory segments to work with.
The relevant stack trace is
"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 tid=0x00007fedb0163800 nid=0x30b7f in Object.wait() [0x00007fedb4f90000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at (C/C++) 0x00007fef201c7dae (Unknown Source)
at (C/C++) 0x00007fef1f2aea07 (Unknown Source)
at (C/C++) 0x00007fef1f241cd3 (Unknown Source)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f6d56450> (a java.util.ArrayDeque)
at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247)- locked <0x00000000f6d56450> (a java.util.ArrayDeque)
at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107)
at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)
at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)
at org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42)
at org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26)
at org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80)
at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)
at org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230)- locked <0x00000000f6a60bd0> (a java.lang.Object)
at org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81)
at org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32)
at org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139)
at org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:125)
at org.apache.beam.vendor.grpc.v1.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:248)
at org.apache.beam.vendor.grpc.v1.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
at org.apache.beam.vendor.grpc.v1.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76)
at org.apache.beam.vendor.grpc.v1.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:263)
at org.apache.beam.vendor.grpc.v1.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:683)
at org.apache.beam.vendor.grpc.v1.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.beam.vendor.grpc.v1.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The full stack trace and logs are attached.
Please take a look and let me know if more information is needed.