Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9413

Tasks can fail with PartitionNotFoundException if consumer deployment takes too long

    XMLWordPrintableJSON

Details

    Description

      Tasks can fail with a PartitionNotFoundException if the deployment of the producer takes too long. More specifically, if it takes longer than the taskmanager.network.request-backoff.max, then the Task will give up and fail.

      The problem is that we calculate the InputGateDeploymentDescriptor for a consuming task once the producer has been assigned a slot but we do not wait until it is actually running. The problem should be fixed if we wait until the task is in state RUNNING before assigning the result partition to the consumer.

      Attachments

        Activity

          People

            Unassigned Unassigned
            trohrmann Till Rohrmann
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10m
                10m