Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-20547

Batch job fails due to the exception in network stack

    XMLWordPrintableJSON

    Details

      Description

      I run a simple batch job with only two job vertices: a source and a sink.

      The parallelisms of them are both 8000. They are connected via all-to-all blocking edges.

      During the running of sink tasks, an exception raises:

      2020-12-09 18:43:48,981 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink: Sink 1 (1595/8000) (08bd4214d6e0dc144e9654f1faaa3b28) switched from RUNNING to FAILED on [masked container name] @ [masked address] (dataPort=47872).
      java.io.IOException: java.lang.IllegalStateException: Inconsistent availability: expected true
      	at org.apache.flink.runtime.io.network.partition.consumer.InputChannel.checkError(InputChannel.java:232) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.RecoveredInputChannel.getNextBuffer(RecoveredInputChannel.java:165) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:626) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:603) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:591) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:109) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:142) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:157) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:372) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:575) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:539) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at java.lang.Thread.run(Thread.java:834) ~[?:1.8.0_102]
      Caused by: java.lang.IllegalStateException: Inconsistent availability: expected true
      	at org.apache.flink.util.Preconditions.checkState(Preconditions.java:198) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.checkConsistentAvailability(LocalBufferPool.java:434) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:564) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:509) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.tryRedistributeBuffers(NetworkBufferPool.java:438) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:166) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:60) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.BufferManager.requestExclusiveBuffers(BufferManager.java:131) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.setup(RemoteInputChannel.java:148) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.RemoteRecoveredInputChannel.toInputChannelInternal(RemoteRecoveredInputChannel.java:76) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.RecoveredInputChannel.toInputChannel(RecoveredInputChannel.java:91) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.convertRecoveredInputChannels(SingleInputGate.java:299) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:285) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:94) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:283) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:184) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	... 5 more
      

       It seems to be an exception in network stack.

      The full log of the job is attached below.

       

        Attachments

        1. inconsistent.tar.gz
          4.65 MB
          Zhilong Hong

          Issue Links

            Activity

              People

              • Assignee:
                kevin.cyj Yingjie Cao
                Reporter:
                Thesharing Zhilong Hong
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: