Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-6185

Selector memory leak with high likelihood of OOM in case of down conversion

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.0.0
    • 1.0.1, 1.1.0
    • core
    • Ubuntu 14.04.5 LTS
      5 brokers: 1&2 on 1.0.0 3,4,5 on 0.11.0.1
      inter.broker.protocol.version=0.11.0.1
      log.message.format.version=0.11.0.1
      clients a mix of 0.9, 0.10, 0.11

    Description

      We are testing 1.0.0 in a couple of environments.
      Both have about 5 brokers, with two 1.0.0 brokers and the rest 0.11.0.1 brokers.
      One is using on disk message format 0.9.0.1, the other 0.11.0.1
      we have 0.9, 0.10, and 0.11 clients connecting.

      The cluster on the 0.9.0.1 format is running fine for a week.

      But the cluster on the 0.11.0.1 format is consistently having memory issues, only on the two upgraded brokers running 1.0.0.

      The first occurrence of the error comes along with this stack trace

      {"timestamp":"2017-11-06 14:22:32,402","level":"ERROR","logger":"kafka.server.KafkaApis","thread":"kafka-request-handler-7","message":"[KafkaApi-1] Error when handling request {replica_id=-1,max_wait_time=500,min_bytes=1,topics=[{topic=maxwell.users,partitions=[{partition=0,fetch_offset=227537,max_bytes=11000000},{partition=4,fetch_offset=354468,max_bytes=11000000},{partition=5,fetch_offset=266524,max_bytes=11000000},{partition=8,fetch_offset=324562,max_bytes=11000000},{partition=10,fetch_offset=292931,max_bytes=11000000},{partition=12,fetch_offset=325718,max_bytes=11000000},{partition=15,fetch_offset=229036,max_bytes=11000000}]}]}"}
      java.lang.OutOfMemoryError: Java heap space
              at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
              at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
              at org.apache.kafka.common.record.AbstractRecords.downConvert(AbstractRecords.java:101)
              at org.apache.kafka.common.record.FileRecords.downConvert(FileRecords.java:253)
              at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$convertedPartitionData$1$1$$anonfun$apply$4.apply(KafkaApis.scala:520)
              at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$convertedPartitionData$1$1$$anonfun$apply$4.apply(KafkaApis.scala:518)
              at scala.Option.map(Option.scala:146)
              at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$convertedPartitionData$1$1.apply(KafkaApis.scala:518)
              at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$convertedPartitionData$1$1.apply(KafkaApis.scala:508)
              at scala.Option.flatMap(Option.scala:171)
              at kafka.server.KafkaApis.kafka$server$KafkaApis$$convertedPartitionData$1(KafkaApis.scala:508)
              at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$createResponse$2$1.apply(KafkaApis.scala:556)
              at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$createResponse$2$1.apply(KafkaApis.scala:555)
              at scala.collection.Iterator$class.foreach(Iterator.scala:891)
              at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
              at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
              at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
              at kafka.server.KafkaApis.kafka$server$KafkaApis$$createResponse$2(KafkaApis.scala:555)
              at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$fetchResponseCallback$1$1.apply(KafkaApis.scala:569)
              at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$fetchResponseCallback$1$1.apply(KafkaApis.scala:569)
              at kafka.server.KafkaApis$$anonfun$sendResponseMaybeThrottle$1.apply$mcVI$sp(KafkaApis.scala:2034)
              at kafka.server.ClientRequestQuotaManager.maybeRecordAndThrottle(ClientRequestQuotaManager.scala:52)
              at kafka.server.KafkaApis.sendResponseMaybeThrottle(KafkaApis.scala:2033)
              at kafka.server.KafkaApis.kafka$server$KafkaApis$$fetchResponseCallback$1(KafkaApis.scala:569)
              at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$processResponseCallback$1$1.apply$mcVI$sp(KafkaApis.scala:588)
              at kafka.server.ClientQuotaManager.maybeRecordAndThrottle(ClientQuotaManager.scala:175)
              at kafka.server.KafkaApis.kafka$server$KafkaApis$$processResponseCallback$1(KafkaApis.scala:587)
              at kafka.server.KafkaApis$$anonfun$handleFetchRequest$3.apply(KafkaApis.scala:604)
              at kafka.server.KafkaApis$$anonfun$handleFetchRequest$3.apply(KafkaApis.scala:604)
              at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:820)
              at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:596)
              at kafka.server.KafkaApis.handle(KafkaApis.scala:100)
      

      And then after a few of those it settles into this kind of pattern

      {"timestamp":"2017-11-06 15:06:48,114","level":"ERROR","logger":"kafka.server.KafkaApis","thread":"kafka-request-handler-1","message":"[KafkaApi-1] Error when handling request {replica_id=-1,max_wait_time=500,min_bytes=1,topics=[{topic=maxwell.accounts,partitions=[{partition=4,fetch_offset=560631,max_bytes=11000000},{partition=8,fetch_offset=557589,max_bytes=11000000},{partition=12,fetch_offset=551712,max_bytes=11000000}]}]}"}
      java.lang.OutOfMemoryError: Java heap space
      {"timestamp":"2017-11-06 15:06:48,811","level":"ERROR","logger":"kafka.server.KafkaApis","thread":"kafka-request-handler-7","message":"[KafkaApi-1] Error when handling request {replica_id=-1,max_wait_time=500,min_bytes=1,topics=[{topic=maxwell.accounts,partitions=[{partition=4,fetch_offset=560631,max_bytes=11000000},{partition=8,fetch_offset=557589,max_bytes=11000000},{partition=12,fetch_offset=551712,max_bytes=11000000}]}]}"}
      java.lang.OutOfMemoryError: Java heap space
      

      I've attached the heap use graphs. It steadily increases to max at which time the error starts appearing.

      I've tripled the heap space for one of the 1.0.0 hosts to see what happens, and it similarly climbs to near 6, then similarly starts having java.lang.OutOfMemoryError errors. I've attached those heap space graphs also, where the line that starts climbing from 2gb was when it was restarted with 6gb heap. The out of memory error started right at the peak of the flatline.

      Here's a snippit from the broker logs: https://gist.github.com/brettrann/4bb8041e884a299b7b0b12645a04492d

      I've redacted some group names because I'd need to check with the teams about making them public. Let me know what more is needed and I can gather it. This is a test cluster and the problem appears reproducible easily enough. Happy to gather as much info as needed.

      Our config is:

      broker.id=2
      delete.topic.enable=true
      auto.create.topics.enable=false
      auto.leader.rebalance.enable=true
      inter.broker.protocol.version=0.11.0.1
      log.message.format.version=0.11.0.1
      group.max.session.timeout.ms = 300000
      port=9092
      num.network.threads=3
      num.io.threads=8
      socket.send.buffer.bytes=102400
      socket.receive.buffer.bytes=102400
      socket.request.max.bytes=104857600
      replica.fetch.max.bytes=10485760
      log.dirs=/data/kafka/logs
      num.partitions=1
      num.recovery.threads.per.data.dir=1
      log.retention.hours=168
      offsets.retention.minutes=10080
      log.segment.bytes=1073741824
      log.retention.check.interval.ms=300000
      log.cleaner.enable=true
      zookeeper.connect=zoo1:2181,zoo2:2181,zoo3:2181/kafka
      zookeeper.connection.timeout.ms=6000
      

      This was also reported attached to the end of this ticket https://issues.apache.org/jira/browse/KAFKA-6042 which is a broker lockup/FD issue, but a new ticket was requested.

      Attachments

        1. Kafka_Internals___Datadog.png
          28 kB
          Brett Rann
        2. Kafka_Internals___Datadog.png
          32 kB
          Brett Rann

        Issue Links

          Activity

            People

              rsivaram Rajini Sivaram
              brettrann Brett Rann
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: