Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-18150

A single failing Kafka broker may cause jobs to fail indefinitely with TimeoutException: Timeout expired while fetching topic metadata

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • 1.10.1
    • None
    • Connectors / Kafka
    • None

    Description

      When a Kafka broker fails that is listed among the bootstrap servers and partition discovery is active, the Flink job reading from that Kafka may enter a failing loop.

      At first, the job seems to react normally without failure with only a short latency spike when switching Kafka leaders.
      Then, it fails with a

      org.apache.flink.streaming.connectors.kafka.internal.Handover$ClosedException
              at org.apache.flink.streaming.connectors.kafka.internal.Handover.close(Handover.java:182)
              at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.cancel(KafkaFetcher.java:175)
              at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.cancel(FlinkKafkaConsumerBase.java:821)
              at org.apache.flink.streaming.api.operators.StreamSource.cancel(StreamSource.java:147)
              at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cancelTask(SourceStreamTask.java:136)
              at org.apache.flink.streaming.runtime.tasks.StreamTask.cancel(StreamTask.java:602)
              at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1355)
              at java.lang.Thread.run(Thread.java:748)
      

      It recovers, but processes fewer than the expected amount of records.

      Finally, the job fails with

      2020-06-05 13:59:37
      org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
      

      and repeats doing so while not processing any records. (The exception comes without any backtrace or otherwise interesting information)

      I have also observed this behavior with partition-discovery turned off, but only when the Flink job failed (after a broker failure) and had to run checkpoint recovery for some other reason.

      Please see the [Environment] description for information on how to reproduce the issue.

      Attachments

        Activity

          People

            aljoscha Aljoscha Krettek
            Caesar Julius Michaelis
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: