Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5553

Job fails during deployment with IllegalStateException from subpartition request

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.3.0
    • Component/s: Network
    • Labels:
      None

      Description

      While running a test job with Flink 1.3-SNAPSHOT (6fb6967b9f9a31f034bd09fcf76aaf147bc8e9a0) the job failed with this exception:

      2017-01-18 14:56:27,043 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Sink: Unnamed (9/10) (befc06d0e792c2ce39dde74b365dd3cf) switched from DEPLOYING to RUNNING.
      2017-01-18 14:56:27,059 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map (9/10) (e94a01ec283e5dce7f79b02cf51654c4) switched from DEPLOYING to RUNNING.
      2017-01-18 14:56:27,817 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map (10/10) (cbb61c9a2f72c282877eb383e111f7cd) switched from RUNNING to FAILED.
      java.lang.IllegalStateException: There has been an error in the channel.
              at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
              at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.addInputChannel(PartitionRequestClientHandler.java:77)
              at org.apache.flink.runtime.io.network.netty.PartitionRequestClient.requestSubpartition(PartitionRequestClient.java:104)
              at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115)
              at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:419)
              at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:441)
              at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:153)
              at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:192)
              at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:270)
              at org.apache.flink.runtime.taskmanager.Task.run(Task.java:666)
              at java.lang.Thread.run(Thread.java:745)
      2017-01-18 14:56:27,819 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Misbehaved Job (b1d985d11984df57400fdff2bb656c59) switched from state RUNNING to FAILING.
      java.lang.IllegalStateException: There has been an error in the channel.
              at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
              at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.addInputChannel(PartitionRequestClientHandler.java:77)
              at org.apache.flink.runtime.io.network.netty.PartitionRequestClient.requestSubpartition(PartitionRequestClient.java:104)
              at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:115)
              at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:419)
              at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:441)
              at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:153)
              at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:192)
              at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:270)
              at org.apache.flink.runtime.taskmanager.Task.run(Task.java:666)
              at java.lang.Thread.run(Thread.java:745)
      

      This is the first exception that is reported to the jobmanager.

      I think this is related to missing network buffers. You see that from the next deployment after the restart, where the deployment fails with the insufficient number of buffers exception.
      I'll add logs to the JIRA.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user NicoK opened a pull request:

          https://github.com/apache/flink/pull/3299

          FLINK-5553 keep the original throwable in PartitionRequestClientHandler

          This way, when checking for a previous error in any input channel, we can throw a meaningful exception instead of the inspecific `IllegalStateException("There has been an error in the channel.")` before.

          Note that the original `Throwable` (from an existing channel) may or may not have been printed by the `InputGate` yet. Any new input channel, however, did not get the `Throwable` and must fail through the (now enhanced) fallback mechanism.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/NicoK/flink flink-5553

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3299.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3299


          commit a722d7cd4c4218543c87c2e8a3b3bbc708bddf55
          Author: Nico Kruber <nico@data-artisans.com>
          Date: 2017-02-13T15:30:59Z

          FLINK-5553 keep the original throwable in PartitionRequestClientHandler

          This way, when checking for a previous error in any input channel, we can throw
          a meaningful exception instead of the inspecific
          IllegalStateException("There has been an error in the channel.") before.

          Note that the original throwable (from an existing channel) may or may not
          have been printed by the InputGate yet. Any new input channel, however, did not
          get the Throwable and must fail through the (now enhanced) fallback mechanism.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user NicoK opened a pull request: https://github.com/apache/flink/pull/3299 FLINK-5553 keep the original throwable in PartitionRequestClientHandler This way, when checking for a previous error in any input channel, we can throw a meaningful exception instead of the inspecific `IllegalStateException("There has been an error in the channel.")` before. Note that the original `Throwable` (from an existing channel) may or may not have been printed by the `InputGate` yet. Any new input channel, however, did not get the `Throwable` and must fail through the (now enhanced) fallback mechanism. You can merge this pull request into a Git repository by running: $ git pull https://github.com/NicoK/flink flink-5553 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3299.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3299 commit a722d7cd4c4218543c87c2e8a3b3bbc708bddf55 Author: Nico Kruber <nico@data-artisans.com> Date: 2017-02-13T15:30:59Z FLINK-5553 keep the original throwable in PartitionRequestClientHandler This way, when checking for a previous error in any input channel, we can throw a meaningful exception instead of the inspecific IllegalStateException("There has been an error in the channel.") before. Note that the original throwable (from an existing channel) may or may not have been printed by the InputGate yet. Any new input channel, however, did not get the Throwable and must fail through the (now enhanced) fallback mechanism.
          Hide
          NicoK Nico Kruber added a comment -

          If you try again with the PR above, there should now be a more meaningful exception that should point you to the original exception.

          From the log, I'd also guess that this is due insufficient network buffers though.

          Show
          NicoK Nico Kruber added a comment - If you try again with the PR above, there should now be a more meaningful exception that should point you to the original exception. From the log, I'd also guess that this is due insufficient network buffers though.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3299

          Good fix, +1

          Merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3299 Good fix, +1 Merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3299

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3299
          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed via af81bebd0fabc6390930689df131e72edab6995b

          Show
          StephanEwen Stephan Ewen added a comment - Fixed via af81bebd0fabc6390930689df131e72edab6995b

            People

            • Assignee:
              NicoK Nico Kruber
              Reporter:
              rmetzger Robert Metzger
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development