Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4477

Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.10.1.0
    • 0.10.1.1
    • core
    • RHEL7

      java version "1.8.0_66"
      Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
      Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

    Description

      We have encountered a critical issue that has re-occured in different physical environments. We haven't worked out what is going on. We do though have a nasty work around to keep service alive.

      We do have not had this issue on clusters still running 0.9.01.

      We have noticed a node randomly shrinking for the partitions it owns the ISR's down to itself, moments later we see other nodes having disconnects, followed by finally app issues, where producing to these partitions is blocked.

      It seems only by restarting the kafka instance java process resolves the issues.

      We have had this occur multiple times and from all network and machine monitoring the machine never left the network, or had any other glitches.

      Below are seen logs from the issue.

      Node 7:
      [2016-12-01 07:01:28,112] INFO Partition [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 1,2,7 to 7 (kafka.cluster.Partition)

      All other nodes:
      [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 (kafka.server.ReplicaFetcherThread)
      java.io.IOException: Connection to 7 was disconnected before the response was read

      All clients:
      java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.

      After this occurs, we then suddenly see on the sick machine an increasing amount of close_waits and file descriptors.

      As a work around to keep service we are currently putting in an automated process that tails and regex's for: and where new_partitions hit just itself we restart the node.

      "[(?P<time>.)] INFO Partition [.] on broker . Shrinking ISR for partition [.*] from (?P<old_partitions>.) to (?P<new_partitions>.+) (kafka.cluster.Partition)"

      Attachments

        1. 2016_12_15.zip
          1.44 MB
          Michael Andre Pearce
        2. 72_Server_Thread_Dump.txt
          62 kB
          Arpan
        3. 73_Server_Thread_Dump.txt
          62 kB
          Arpan
        4. 74_Server_Thread_Dump
          58 kB
          Arpan
        5. issue_node_1001_ext.log
          487 kB
          Tom DeVoe
        6. issue_node_1001.log
          555 kB
          Tom DeVoe
        7. issue_node_1002_ext.log
          193 kB
          Tom DeVoe
        8. issue_node_1002.log
          162 kB
          Tom DeVoe
        9. issue_node_1003_ext.log
          489 kB
          Tom DeVoe
        10. issue_node_1003.log
          507 kB
          Tom DeVoe
        11. kafka.jstack
          75 kB
          Michael Andre Pearce
        12. server_1_72server.log
          39 kB
          Arpan
        13. server_2_73_server.log
          41 kB
          Arpan
        14. server_3_74Server.log
          40 kB
          Arpan
        15. state_change_controller.tar.gz
          11 kB
          Tom DeVoe

        Activity

          People

            apurva Apurva Mehta
            michael.andre.pearce Michael Andre Pearce
            Votes:
            10 Vote for this issue
            Watchers:
            35 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: