Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-24667

Channel state writer would fail the task directly if meeting exception previously

    XMLWordPrintableJSON

Details

    Description

      Currently, if channel state writer come across exception when closing a file, such as meet exception during SubtaskCheckpointCoordinatorImpl#cancelAsyncCheckpointRunnable, it will exit the loop. However, in the following channelStateWriter#abort it will throw exception directly:

      switched from RUNNING to FAILED with failure cause: java.io.IOException: java.lang.RuntimeException: unable to send request to worker
      	at org.apache.flink.runtime.io.network.partition.consumer.InputChannel.checkError(InputChannel.java:228)
      	at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.checkPartitionRequestQueueInitialized(RemoteInputChannel.java:735)
      	at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.getNextBuffer(RemoteInputChannel.java:204)
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:651)
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:626)
      	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:612)
      	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:109)
      	at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:149)
      	at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
      	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
      	at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:98)
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:424)
      	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:685)
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:640)
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:651)
      	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:624)
      	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:798)
      	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
      

      This is not expected as checkpoint failure should not lead to task failover each time.

      Attachments

        Issue Links

          Activity

            People

              roman Roman Khachatryan
              yunta Yun Tang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: