Kafka
  1. Kafka
  2. KAFKA-1024

possible memory leak in 0.8 beta1 producer with G1GC

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:
      java version "1.7.0_17"
      Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
      Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

      Description

      I have this in my pom.xml

      <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka_2.9.2</artifactId>
      <version>0.8.0-beta1</version>
      </dependency>
      <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>2.9.2</version>
      </dependency>
      <dependency>
      <groupId>org.xerial.snappy</groupId>
      <artifactId>snappy-java</artifactId>
      <version>1.0.4.1</version>
      </dependency>

      I'm using snappy compression codec and producing about 7k msg/sec on average. I'm producing batches up to 10 messages per batch with null keys to a topic with 16 partitions, no replication. Xms and Xmx are 64MB and I'm using XX:+UseG1GC. After about 12 hours of operation heap usage hits right up against the 64MB limit and the producer drops to about 4k msg/sec because of the GC pressure. When I restart the process the heap usage goes back to normal (around 30MB) and the producer does 7k msg/sec again.

      What else can I provide to help debug this?

        Activity

        Hide
        Jay Kreps added a comment -

        64MB is a really small heap. Can you validate that this is a leak and not just "not enough memory". To do so set the heap to something like 1GB and validate that it still runs out of memory.

        If you want to tune for less memory reduce the queue length on the producer.

        Show
        Jay Kreps added a comment - 64MB is a really small heap. Can you validate that this is a leak and not just "not enough memory". To do so set the heap to something like 1GB and validate that it still runs out of memory. If you want to tune for less memory reduce the queue length on the producer.
        Hide
        Jason Toffaletti added a comment -

        I did try 128MB with the same problem, I'll try reducing the queue length. 64MB came from the rabbitmq producer I am attempting to replace, so a small heap should work in theory.

        Show
        Jason Toffaletti added a comment - I did try 128MB with the same problem, I'll try reducing the queue length. 64MB came from the rabbitmq producer I am attempting to replace, so a small heap should work in theory.
        Hide
        Jason Toffaletti added a comment -

        To confirm, I'm using this reference:

        http://kafka.apache.org/08/configuration.html

        The setting I want to change is queue.buffering.max.messages ?

        Show
        Jason Toffaletti added a comment - To confirm, I'm using this reference: http://kafka.apache.org/08/configuration.html The setting I want to change is queue.buffering.max.messages ?
        Hide
        Jay Kreps added a comment -

        Yes.

        Show
        Jay Kreps added a comment - Yes.
        Hide
        Jason Toffaletti added a comment -

        Process has been up for 19 hours now without running out of memory at Xmx 128MB and queue.buffering.max.messages 1000 so I think your diagnosis of "not enough memory" because of the queue length was correct. My average message size is 2KB so even at 10,000 messages I didn't expect it to use more than 20MB. I'd like to understand what side effects reducing the queue length has and why 10000 is the default. Does it mean that if the queue is full and queue.buffering.max.ms is reached waiting to insert to the queue, I'll see dropped messages?

        Show
        Jason Toffaletti added a comment - Process has been up for 19 hours now without running out of memory at Xmx 128MB and queue.buffering.max.messages 1000 so I think your diagnosis of "not enough memory" because of the queue length was correct. My average message size is 2KB so even at 10,000 messages I didn't expect it to use more than 20MB. I'd like to understand what side effects reducing the queue length has and why 10000 is the default. Does it mean that if the queue is full and queue.buffering.max.ms is reached waiting to insert to the queue, I'll see dropped messages?
        Hide
        Jay Kreps added a comment -

        Your serialized message size is 2KB, perhaps but I wonder if this is true of the java objects themselves. If these are compound objects with lots of little subobjects that can add a ton of overhead.

        Yes when you reach the max queue length you can either block the sender or drop the message. That timeout controls the time until data will be dropped (-1 is never drop).

        One downside of counting the buffer in terms of messages is that most people don't really have a good idea of their message sizes. So it is hard to say if 10k is too big or too small. In a future version we may improve the client to buffer messages already serialized so we can give the bound in terms of memory space, which I think would be more intuitive.

        Show
        Jay Kreps added a comment - Your serialized message size is 2KB, perhaps but I wonder if this is true of the java objects themselves. If these are compound objects with lots of little subobjects that can add a ton of overhead. Yes when you reach the max queue length you can either block the sender or drop the message. That timeout controls the time until data will be dropped (-1 is never drop). One downside of counting the buffer in terms of messages is that most people don't really have a good idea of their message sizes. So it is hard to say if 10k is too big or too small. In a future version we may improve the client to buffer messages already serialized so we can give the bound in terms of memory space, which I think would be more intuitive.
        Hide
        Jason Toffaletti added a comment -

        I may have spoken too soon, it might just be taking more time to reach critical levels. The GC activity reported by jvisualvm is slowly creeping up from rarely peaking at 0.1% to steadily hovering around 1.3% over the past hour and heap size has climbed from 50MB to over 80MB.

        Object Histogram:
        
        num 	  #instances	#bytes	Class description
        --------------------------------------------------------------------------
        1:		488074	26414648	char[]
        2:		15191	20018704	byte[]
        3:		486709	11681016	java.lang.String
        4:		63839	8693080	* MethodKlass
        5:		63839	8252816	* ConstMethodKlass
        6:		93927	7522528	java.util.HashMap$Entry[]
        7:		5468	5868952	* ConstantPoolKlass
        8:		106269	5714200	java.lang.Object[]
        9:		91418	5119408	java.util.HashMap
        10:		5468	4572616	* InstanceKlassKlass
        11:		4538	3346656	* ConstantPoolCacheKlass
        12:		79237	3169480	java.util.TreeMap$Entry
        13:		45404	2912840	java.util.Hashtable$Entry[]
        14:		90667	2901344	java.util.Vector
        15:		45352	2539712	java.util.Hashtable
        16:		45330	1813200	java.security.ProtectionDomain
        17:		14911	1650304	int[]
        ...
        Total : 	2327288	140772592
        Heap traversal took 46.218 seconds.
        

        I'll update this in a few hours to get an idea where the difference is. My messages aren't complex, I'm using DefaultEncoder with KeyedMessage<byte[], byte[]> and null keys, so I thought I had a pretty good idea what size the messages are.

        I'm running this test on dev machines being used for many other purposes so it is very likely zookeeper will be slow and I'll end up hitting the max queue depth during a metadata refresh. However, I've not seen dropped messages on any producers yet.

        Show
        Jason Toffaletti added a comment - I may have spoken too soon, it might just be taking more time to reach critical levels. The GC activity reported by jvisualvm is slowly creeping up from rarely peaking at 0.1% to steadily hovering around 1.3% over the past hour and heap size has climbed from 50MB to over 80MB. Object Histogram: num #instances #bytes Class description -------------------------------------------------------------------------- 1: 488074 26414648 char[] 2: 15191 20018704 byte[] 3: 486709 11681016 java.lang.String 4: 63839 8693080 * MethodKlass 5: 63839 8252816 * ConstMethodKlass 6: 93927 7522528 java.util.HashMap$Entry[] 7: 5468 5868952 * ConstantPoolKlass 8: 106269 5714200 java.lang.Object[] 9: 91418 5119408 java.util.HashMap 10: 5468 4572616 * InstanceKlassKlass 11: 4538 3346656 * ConstantPoolCacheKlass 12: 79237 3169480 java.util.TreeMap$Entry 13: 45404 2912840 java.util.Hashtable$Entry[] 14: 90667 2901344 java.util.Vector 15: 45352 2539712 java.util.Hashtable 16: 45330 1813200 java.security.ProtectionDomain 17: 14911 1650304 int[] ... Total : 2327288 140772592 Heap traversal took 46.218 seconds. I'll update this in a few hours to get an idea where the difference is. My messages aren't complex, I'm using DefaultEncoder with KeyedMessage<byte[], byte[]> and null keys, so I thought I had a pretty good idea what size the messages are. I'm running this test on dev machines being used for many other purposes so it is very likely zookeeper will be slow and I'll end up hitting the max queue depth during a metadata refresh. However, I've not seen dropped messages on any producers yet.
        Hide
        Jason Toffaletti added a comment -
        num 	  #instances	#bytes	Class description
        --------------------------------------------------------------------------
        1:		548504	29644120	char[]
        2:		14663	14153936	byte[]
        3:		547282	13134768	java.lang.String
        4:		114927	9202528	java.util.HashMap$Entry[]
        5:		64379	8766520	* MethodKlass
        6:		64379	8314168	* ConstMethodKlass
        7:		127311	6885688	java.lang.Object[]
        8:		112352	6291712	java.util.HashMap
        9:		5579	5986248	* ConstantPoolKlass
        10:		5579	4657080	* InstanceKlassKlass
        11:		55875	3582984	java.util.Hashtable$Entry[]
        12:		111601	3571232	java.util.Vector
        13:		4652	3436992	* ConstantPoolCacheKlass
        14:		82004	3280160	java.util.TreeMap$Entry
        15:		55819	3125864	java.util.Hashtable
        16:		55797	2231880	java.security.ProtectionDomain
        17:		16616	2048808	int[]
        ...
        Total : 	2632815	148775120
        Heap traversal took 51.673 seconds.
        

        The growth is slow but still there. I'm going to try a different garbage collector to rule that out. This could just be some odd interaction between G1GC and Scala creating lots of short lived objects. As I understand, G1GC tries to run collections in a fixed amount of time, so it might just be getting overwhelmed with garbage. I also haven't forgotten your advice to try a bigger heap.

        Show
        Jason Toffaletti added a comment - num #instances #bytes Class description -------------------------------------------------------------------------- 1: 548504 29644120 char[] 2: 14663 14153936 byte[] 3: 547282 13134768 java.lang.String 4: 114927 9202528 java.util.HashMap$Entry[] 5: 64379 8766520 * MethodKlass 6: 64379 8314168 * ConstMethodKlass 7: 127311 6885688 java.lang.Object[] 8: 112352 6291712 java.util.HashMap 9: 5579 5986248 * ConstantPoolKlass 10: 5579 4657080 * InstanceKlassKlass 11: 55875 3582984 java.util.Hashtable$Entry[] 12: 111601 3571232 java.util.Vector 13: 4652 3436992 * ConstantPoolCacheKlass 14: 82004 3280160 java.util.TreeMap$Entry 15: 55819 3125864 java.util.Hashtable 16: 55797 2231880 java.security.ProtectionDomain 17: 16616 2048808 int[] ... Total : 2632815 148775120 Heap traversal took 51.673 seconds. The growth is slow but still there. I'm going to try a different garbage collector to rule that out. This could just be some odd interaction between G1GC and Scala creating lots of short lived objects. As I understand, G1GC tries to run collections in a fixed amount of time, so it might just be getting overwhelmed with garbage. I also haven't forgotten your advice to try a bigger heap.
        Hide
        Jason Toffaletti added a comment -

        Using the default Parallel GC, memory usage hasn't grown in 4 hours and CPU usage shown by top is down from 240% to 140%.

        Some selections from jmap histogram show a dramatic difference.

        6:		193630	4647120	java.lang.String
        8:		11365	3939472	byte[]
        18:		13673	476928	java.lang.Object[]
        23:		3558	293600	java.util.HashMap$Entry[]
        38:		1092	61152	java.util.HashMap
        62:		343	10976	java.util.Vector
        

        All signs point to G1GC being the problem.

        Show
        Jason Toffaletti added a comment - Using the default Parallel GC, memory usage hasn't grown in 4 hours and CPU usage shown by top is down from 240% to 140%. Some selections from jmap histogram show a dramatic difference. 6: 193630 4647120 java.lang.String 8: 11365 3939472 byte[] 18: 13673 476928 java.lang.Object[] 23: 3558 293600 java.util.HashMap$Entry[] 38: 1092 61152 java.util.HashMap 62: 343 10976 java.util.Vector All signs point to G1GC being the problem.
        Hide
        Jay Kreps added a comment -

        Interesting....

        Show
        Jay Kreps added a comment - Interesting....

          People

          • Assignee:
            Unassigned
            Reporter:
            Jason Toffaletti
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development