Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-5973

ShutdownableThread catching errors can lead to partial hard to diagnose broker failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.11.0.0, 0.11.0.1
    • None
    • core
    • None

    Description

      When any kafka broker ShutdownableThread subclasses crashes due to an
      uncaught exception, the broker is left running in a very weird/bad state with some
      threads not running, but potentially the broker can still be serving traffic to
      users but not performing its usual operations.

      This is problematic, because monitoring may say that "the broker is up and fine", but in fact it is not healthy.

      At Heroku we've been mitigating this by monitoring all threads that "should" be
      running on a broker and alerting when a given thread isn't running for some
      reason.

      Things that use ShutdownableThread that can crash and leave a broker/the controller in a bad state:

      • log cleaner
      • replica fetcher threads
      • controller to broker send threads
      • controller topic deletion threads
      • quota throttling reapers
      • io threads
      • network threads
      • group metadata management threads

      Some of these can have disasterous consequences, and nearly all of them crashing for any reason is a cause for alert.
      But, users probably shouldn't have to know about all the internals of Kafka and run thread dumps periodically as part of normal operations.

      There are a few potential options here:

      1. On the crash of any ShutdownableThread, shutdown the whole broker process

      We could crash the whole broker when an individual thread dies. I think this is pretty reasonable, it's better to have a very visible breakage than a very hard to detect one.

      2. Add some healthcheck JMX bean to detect these thread crashes

      Users having to audit all of Kafka's source code on each new release and track a list of "threads that should be running" is... pretty silly. We could instead expose a JMX bean of some kind indicating threads that died due to uncaught exceptions

      3. Do nothing, but add documentation around monitoring/logging that exposes this error

      These thread deaths do emit log lines, but it's not that clear or obvious to users they need to monitor and alert on them. The project could add documentation

      Attachments

        1. 5973.v1.txt
          0.9 kB
          Ted Yu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tcrayford-heroku Tom Crayford
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: