Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-8654

Cant restart heartbeatThread if encountered unexpected exception in heartbeatloop.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.1.0
    • None
    • consumer
    • None

    Description

      There is a consumer in our cluster which has relatively high cpu usage for several days caused by kafka poll thread. So we dig in to find out that was because org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat returned zero leading to non-blocking select which in turn leading to pollForFetches returned immediately. But the actual poll timeout is set to 1s, so pollForFetches was called thousands of time per poll/second.

      We use tool to inspect memory variables which show the corresponding heartbeatTimer's attribute:  

      @Timer[
      time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627],
      startMs=@Long[1562075783801], // Jul 02 2019 13:56:23
      currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21
      deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33
      ]

      That shows that heartbeat hasn't been happening for about 10 days, and at 07-02 13:56 we did restarted brokers. And jstack shows the corresponding heartbeatThread is dead. Unfortunately we dont keep logs for that long so I cant figure out what happened then. 

      IMO heartbeatThread is too important to be left dead, there should be at least some way to revive it, but it seems that startHeartbeatThreadIfNeeded can only be triggered by restarting or heartBeat itself.

      It's also confusing that almost everything in org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run is async so it seems impossible for any exception to happen, so why is there so many catch clause?

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            nicktheuncharted nick allen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: