Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13335

Upgrading connect from 2.7.0 to 2.8.0 causes worker instability

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.8.0
    • None
    • connect
    • None

    Description

      After recently upgrading our connect cluster to 2.8.0 (via strimzi+Kubernetes, brokers are still on 2.7.0), I am noticing that the cluster is struggling to stabilize. Connectors are being unassigned/reassigned/duplicated continuously, and never settling back down. A downgrade back to 2.7.0 fixes things immediately. I have attached a picture of our Grafana dashboards showing some metrics. We have a connect cluster with 4 nodes, trying to maintain about 1000 connectors, each connector with a maxTask of 1.

      We are noticing a slow increase in memory usage with big random peaks of tasks counts and thread counts.

      I do also notice over the course of letting 2.8.0 run a huge increase in logs stating that

      ERROR Graceful stop of task (task name here) failed.

      , but the logs do not seem to indicate a reason. The connector appears to be stopped only seconds after its creation. It appears to only affect our source connectors. These logs stop after downgrading back to 2.7.0.

      I am also seeing an increase in logs stating that

      Couldn't instantiate task (task name) because it has an invalid task configuration. This task will not execute until reconfigured. (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [StartAndStopExecutor-connect-1-1]
      org.apache.kafka.connect.errors.ConnectException: Task already exists in this worker: (task name)
      	at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:512)
      	at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:1251)
      	at org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1700(DistributedHerder.java:127)
      	at org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1266)
      	at org.apache.kafka.connect.runtime.distributed.DistributedHerder$10.call(DistributedHerder.java:1262)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)

      I am not sure what could be causing this, any insight would be appreciated!
      I do notice Kafka 2.7.1/2.8.0 contains a bugfix related to connect rebalances (KAFKA-10413). Is that fix potentially causing instability?

      Attachments

        Activity

          People

            Unassigned Unassigned
            gray.john John Gray
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: