Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-6013

Controller getting stuck

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11.0.0, 0.11.0.1
    • Fix Version/s: None
    • Component/s: controller
    • Labels:

      Description

      It looks like a new issue in 0.11.0.0 and 0.11.0.1 still has it.

      We upgraded one of the clusters from 0.11.0.0 to 0.11.0.1 by shutting down 28 machines at once (single rack). When nodes came up none of them progressed after these log lines:

      Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka version : 0.11.0.1 (org.apache.kafka.common.utils.AppInfoParser)
      Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka commitId : c2a0d5f9b1f45bf5 (org.apache.kafka.common.utils.AppInfoParser)
      Oct 05 02:17:42 mybroker14 kafka[32940]: INFO [Kafka Server 10014], started (kafka.server.KafkaServer)
      

      There was no indication in controller node logs that it picked up rebooted nodes. This happened multiple times during the upgrade: once per rack plus some on top of that.

      Reboot took ~20m, all nodes in a single rack rebooted in parallel.

      The fix was to restart controller node, but that did not go cleanly too:

      ivan@mybroker26:~$ sudo journalctl --since 01:00 -u kafka | fgrep 'Error during controlled shutdown' -A1
      Oct 05 01:57:41 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error during controlled shutdown, possibly because leader movement took longer than the configured controller.socket.timeout.ms and/or request.timeout.ms: Connection to 10026 was disconnected before the response was read (kafka.server.KafkaServer)
      Oct 05 01:57:46 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying controlled shutdown after the previous attempt failed... (kafka.server.KafkaServer)
      --
      Oct 05 01:58:16 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error during controlled shutdown, possibly because leader movement took longer than the configured controller.socket.timeout.ms and/or request.timeout.ms: Connection to 10026 was disconnected before the response was read (kafka.server.KafkaServer)
      Oct 05 01:58:18 mybroker26 kafka[37409]: INFO Rolled new log segment for 'requests-40' in 3 ms. (kafka.log.Log)
      --
      Oct 05 01:58:51 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error during controlled shutdown, possibly because leader movement took longer than the configured controller.socket.timeout.ms and/or request.timeout.ms: Connection to 10026 was disconnected before the response was read (kafka.server.KafkaServer)
      Oct 05 01:58:56 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying controlled shutdown after the previous attempt failed... (kafka.server.KafkaServer)
      

      I'm unable to reproduce the issue by just restarting or even rebooting one broker, controller picks it up:

      Oct 05 03:18:18 mybroker83 kafka[37402]: INFO [Controller 10083]: Newly added brokers: 10001, deleted brokers: , all live brokers: ...
      

      KAFKA-5028 happened in 0.11.0.0, so it's likely related.

      cc Ismael Juma

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              bobrik Ivan Babrou
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: