Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-6199

Single broker with fast growing heap usage

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.10.2.1
    • 1.1.0
    • None
    • None
    • Amazon Linux

    Description

      We have a single broker in our cluster of 25 with fast growing heap usage which necessitates us restarting it every 12 hours. If we don't restart the broker, it becomes very slow from long GC pauses and eventually has OutOfMemory errors.

      See Screen Shot 2017-11-10 at 11.59.06 AM.png for a graph of heap usage percentage on the broker. A "normal" broker in the same cluster stays below 50% (averaged) over the same time period.

      We have taken heap dumps when the broker's heap usage is getting dangerously high, and there are a lot of retained NetworkSend objects referencing byte buffers.

      We also noticed that the single affected broker logs a lot more of this kind of warning than any other broker:

      WARN Attempting to send response via channel for which there is no open connection, connection id 13 (kafka.network.Processor)
      

      See Screen Shot 2017-11-10 at 1.55.33 PM.png for counts of that WARN log message visualized across all the brokers (to show it happens a bit on other brokers, but not nearly as much as it does on the "bad" broker).

      I can't make the heap dumps public, but would appreciate advice on how to pin down the problem better. We're currently trying to narrow it down to a particular client, but without much success so far.

      Let me know what else I could investigate or share to track down the source of this leak.

      Attachments

        1. jstack-2017-12-08.scrubbed.out
          221 kB
          Robin Tweedie
        2. histo_live_20171206.txt
          120 kB
          Robin Tweedie
        3. histo_live_80.txt
          121 kB
          Robin Tweedie
        4. histo_live.txt
          135 kB
          Robin Tweedie
        5. dominator_tree.png
          948 kB
          Robin Tweedie
        6. path2gc.png
          1.01 MB
          Robin Tweedie
        7. merge_shortest_paths.png
          499 kB
          Robin Tweedie
        8. Screen Shot 2017-11-10 at 11.59.06 AM.png
          218 kB
          Robin Tweedie
        9. Screen Shot 2017-11-10 at 1.55.33 PM.png
          65 kB
          Robin Tweedie

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rt_skyscanner Robin Tweedie
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: