Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5553

Job fails during deployment with IllegalStateException from subpartition request

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.3.0
    • Component/s: Network
    • Labels:
      None

      Description

      While running a test job with Flink 1.3-SNAPSHOT (6fb6967b9f9a31f034bd09fcf76aaf147bc8e9a0) the job failed with this exception:

      2017-01-18 14:56:27,043 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Sink: Unnamed (9/10) (befc06d0e792c2ce39dde74b365dd3cf) switched from DEPLOYING to RUNNING.
      2017-01-18 14:56:27,059 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map (9/10) (e94a01ec283e5dce7f79b02cf51654c4) switched from DEPLOYING to RUNNING.
      2017-01-18 14:56:27,817 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map (10/10) (cbb61c9a2f72c282877eb383e111f7cd) switched from RUNNING to FAILED.
      java.lang.IllegalStateException: There has been an error in the channel.
              at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
              at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.addInputChannel(PartitionRequestClientHandler.java:77)
              at org.apache.flink.runtime.io.network.netty.PartitionRequestClient.requestSubpartition(PartitionRequestClient.java:104)
              at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115)
              at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:419)
              at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:441)
              at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:153)
              at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:192)
              at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:270)
              at org.apache.flink.runtime.taskmanager.Task.run(Task.java:666)
              at java.lang.Thread.run(Thread.java:745)
      2017-01-18 14:56:27,819 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Misbehaved Job (b1d985d11984df57400fdff2bb656c59) switched from state RUNNING to FAILING.
      java.lang.IllegalStateException: There has been an error in the channel.
              at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
              at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.addInputChannel(PartitionRequestClientHandler.java:77)
              at org.apache.flink.runtime.io.network.netty.PartitionRequestClient.requestSubpartition(PartitionRequestClient.java:104)
              at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115)
              at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:419)
              at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:441)
              at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:153)
              at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:192)
              at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:270)
              at org.apache.flink.runtime.taskmanager.Task.run(Task.java:666)
              at java.lang.Thread.run(Thread.java:745)
      

      This is the first exception that is reported to the jobmanager.

      I think this is related to missing network buffers. You see that from the next deployment after the restart, where the deployment fails with the insufficient number of buffers exception.
      I'll add logs to the JIRA.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                NicoK Nico Kruber
                Reporter:
                rmetzger Robert Metzger
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: