Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-15238

Connect workers can be disabled by DLQ-related blocking admin client calls

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.6.0, 3.5.2
    • connect
    • None

    Description

      When Kafka Connect is run in distributed mode - if a sink connector's task is restarted (via a worker's REST API), the following sequence of steps will occur (on the DistributedHerder's thread):

       

      1. The existing sink task will be stopped (ref)
      2. A new sink task will be started (ref)
      3. As a part of the above step, a new WorkerSinkTask will be instantiated (ref)
      4. The DLQ reporter (see KIP-298) for the sink task is also instantiated and configured as a part of this (ref)
      5. The DLQ reporter setup involves two synchronous admin client calls to list topics and create the DLQ topic if it isn't already created (ref)

       

      All of these are occurring synchronously on the herder's tick thread - in this portion here where external requests are run. If the admin client call in the DLQ reporter setup step blocks for some time (due to auth failures and retries or network issues or whatever other reason), this can cause the Connect worker to become non-functional (REST API requests will timeout) and even fall out of the Connect cluster and become a zombie (since the tick thread also drives group membership functions - see here, here).

      Attachments

        Issue Links

          Activity

            People

              yash.mayya Yash Mayya
              yash.mayya Yash Mayya
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: