Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-7304

memory leakage in org.apache.kafka.common.network.Selector

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.1.0, 1.1.1
    • 2.0.1
    • core
    • None

    Description

      We are testing secured writing to kafka through ssl. Testing at small scale, ssl writing to kafka was fine. However, when we enabled ssl writing at a larger scale (>40k clients write concurrently), the kafka brokers soon hit OutOfMemory issue with 4G memory setting. We have tried with increasing the heap size to 10Gb, but encountered the same issue.

      We took a few heap dumps , and found that most of the heap memory is referenced through org.apache.kafka.common.network.Selector objects. There are two Channel maps field in Selector. It seems that somehow the objects is not deleted from the map in a timely manner.

      One observation is that the memory leak seems relate to kafka partition leader changes. If there is broker restart etc. in the cluster that caused partition leadership change, the brokers may hit the OOM issue faster.

          private final Map<String, KafkaChannel> channels;
          private final Map<String, KafkaChannel> closingChannels;
      

      Please see the attached images and the following link for sample gc analysis.

      http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDgvMTcvLS1nYy5sb2cuMC5jdXJyZW50Lmd6LS0yLTM5LTM0

      the command line for running kafka:

      java -Xms10g -Xmx10g -XX:NewSize=512m -XX:MaxNewSize=512m -Xbootclasspath/p:/usr/local/libs/bcp -XX:MetaspaceSize=128m -XX:+UseG1GC -XX:MaxGCPauseMillis=25 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=25 -XX:MaxMetaspaceFreeRatio=75 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/var/log/kafka/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=40 -XX:GCLogFileSize=50M -Djava.awt.headless=true -Dlog4j.configuration=file:/etc/kafka/log4j.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.rmi.port=9999 -cp /usr/local/libs/*  kafka.Kafka /etc/kafka/server.properties
      

      We use java 1.8.0_102, and has applied a TLS patch on reducing X509Factory.certCache map size from 750 to 20.

      java -version
      java version "1.8.0_102"
      Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
      

      Attachments

        1. Screen Shot 2018-08-16 at 11.06.38 PM.png
          751 kB
          Yu Yang
        2. Screen Shot 2018-08-16 at 11.04.16 PM.png
          108 kB
          Yu Yang
        3. Screen Shot 2018-08-16 at 4.26.19 PM.png
          44 kB
          Yu Yang
        4. Screen Shot 2018-08-16 at 12.41.26 PM.png
          327 kB
          Yu Yang
        5. Screen Shot 2018-08-17 at 1.03.35 AM.png
          838 kB
          Yu Yang
        6. Screen Shot 2018-08-17 at 1.04.32 AM.png
          638 kB
          Yu Yang
        7. Screen Shot 2018-08-17 at 1.05.30 AM.png
          427 kB
          Yu Yang
        8. 7304.v4.txt
          2 kB
          Ted Yu
        9. 7304.v7.txt
          4 kB
          Ted Yu
        10. Screen Shot 2018-08-28 at 11.09.45 AM.png
          83 kB
          Yu Yang
        11. Screen Shot 2018-08-29 at 10.50.47 AM.png
          251 kB
          Yu Yang
        12. Screen Shot 2018-08-29 at 10.49.03 AM.png
          664 kB
          Yu Yang
        13. Screen Shot 2018-09-29 at 8.34.50 PM.png
          1.26 MB
          Yu Yang
        14. Screen Shot 2018-09-29 at 10.38.12 PM.png
          131 kB
          Yu Yang
        15. Screen Shot 2018-09-29 at 10.38.38 PM.png
          1.49 MB
          Yu Yang

        Issue Links

          Activity

            People

              rsivaram Rajini Sivaram
              yuyang08 Yu Yang
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: