Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-20547

Batch job fails due to the exception in network stack

    XMLWordPrintableJSON

Details

    Description

      I run a simple batch job with only two job vertices: a source and a sink.

      The parallelisms of them are both 8000. They are connected via all-to-all blocking edges.

      During the running of sink tasks, an exception raises:

      2020-12-09 18:43:48,981 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink: Sink 1 (1595/8000) (08bd4214d6e0dc144e9654f1faaa3b28) switched from RUNNING to FAILED on [masked container name] @ [masked address] (dataPort=47872).
      java.io.IOException: java.lang.IllegalStateException: Inconsistent availability: expected true
      	at org.apache.flink.runtime.io.network.partition.consumer.InputChannel.checkError(InputChannel.java:232) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.RecoveredInputChannel.getNextBuffer(RecoveredInputChannel.java:165) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:626) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:603) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:591) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:109) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:142) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:157) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:372) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:575) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:539) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at java.lang.Thread.run(Thread.java:834) ~[?:1.8.0_102]
      Caused by: java.lang.IllegalStateException: Inconsistent availability: expected true
      	at org.apache.flink.util.Preconditions.checkState(Preconditions.java:198) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.checkConsistentAvailability(LocalBufferPool.java:434) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:564) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:509) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.tryRedistributeBuffers(NetworkBufferPool.java:438) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:166) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:60) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.BufferManager.requestExclusiveBuffers(BufferManager.java:131) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.setup(RemoteInputChannel.java:148) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.RemoteRecoveredInputChannel.toInputChannelInternal(RemoteRecoveredInputChannel.java:76) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.RecoveredInputChannel.toInputChannel(RecoveredInputChannel.java:91) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.convertRecoveredInputChannels(SingleInputGate.java:299) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:285) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:94) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:283) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:184) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
      	... 5 more
      

       It seems to be an exception in network stack.

      The full log of the job is attached below.

       

      Attachments

        1. inconsistent.tar.gz
          4.65 MB
          Zhilong Hong

        Issue Links

          Activity

            People

              kevin.cyj Yingjie Cao
              Thesharing Zhilong Hong
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: