Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-4477

Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.10.1.0
    • 0.10.1.1
    • core
    • RHEL7

      java version "1.8.0_66"
      Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
      Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

    Description

      We have encountered a critical issue that has re-occured in different physical environments. We haven't worked out what is going on. We do though have a nasty work around to keep service alive.

      We do have not had this issue on clusters still running 0.9.01.

      We have noticed a node randomly shrinking for the partitions it owns the ISR's down to itself, moments later we see other nodes having disconnects, followed by finally app issues, where producing to these partitions is blocked.

      It seems only by restarting the kafka instance java process resolves the issues.

      We have had this occur multiple times and from all network and machine monitoring the machine never left the network, or had any other glitches.

      Below are seen logs from the issue.

      Node 7:
      [2016-12-01 07:01:28,112] INFO Partition [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 1,2,7 to 7 (kafka.cluster.Partition)

      All other nodes:
      [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 (kafka.server.ReplicaFetcherThread)
      java.io.IOException: Connection to 7 was disconnected before the response was read

      All clients:
      java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.

      After this occurs, we then suddenly see on the sick machine an increasing amount of close_waits and file descriptors.

      As a work around to keep service we are currently putting in an automated process that tails and regex's for: and where new_partitions hit just itself we restart the node.

      "[(?P<time>.)] INFO Partition [.] on broker . Shrinking ISR for partition [.*] from (?P<old_partitions>.) to (?P<new_partitions>.+) (kafka.cluster.Partition)"

      Attachments

        1. kafka.jstack
          75 kB
          Michael Andre Pearce
        2. issue_node_1002.log
          162 kB
          Tom DeVoe
        3. issue_node_1003.log
          507 kB
          Tom DeVoe
        4. issue_node_1001.log
          555 kB
          Tom DeVoe
        5. issue_node_1001_ext.log
          487 kB
          Tom DeVoe
        6. issue_node_1002_ext.log
          193 kB
          Tom DeVoe
        7. issue_node_1003_ext.log
          489 kB
          Tom DeVoe
        8. state_change_controller.tar.gz
          11 kB
          Tom DeVoe
        9. 2016_12_15.zip
          1.44 MB
          Michael Andre Pearce
        10. server_1_72server.log
          39 kB
          Arpan
        11. server_2_73_server.log
          41 kB
          Arpan
        12. server_3_74Server.log
          40 kB
          Arpan
        13. 72_Server_Thread_Dump.txt
          62 kB
          Arpan
        14. 73_Server_Thread_Dump.txt
          62 kB
          Arpan
        15. 74_Server_Thread_Dump
          58 kB
          Arpan

        Activity

          People

            apurva Apurva Mehta
            michael.andre.pearce Michael Andre Pearce
            Votes:
            10 Vote for this issue
            Watchers:
            35 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: