Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-706

broker appears to be encoding ProduceResponse, but never sending it

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0
    • None
    • core
    • None
    • reproduced on both Mac OS and RH linux, via private node.js client

    Description

      By all appearances, I seem to be able to convince a broker to periodically encode, but never transmit, a ProduceResponse. Unfortunately my client is proprietary, but I will share it with Neha via LI channels. But I will describe what's going on in the hopes that there's another trivial way to reproduce it. (I did search through JIRA, and haven't found anything that looks like this.)

      I am running a single instance zookeeper and single broker. I have a client that generates configurable amounts of data, tracking what is produced (both sent and ACK'd), and what is consumed. I was noticing that when using high transfer rates via high frequency single messages, my unack'd queue appeared to be getting continuously larger. So, I outfitted my client to log more information about correlation ids at various stages, and modified the kafka ProducerRequest/ProducerResponse to log (de)serialization of the same. I then used tcpdump to intercept all communications between my client and the broker. Finally, I configured my client to generate 1 message per ~10ms, each payload being approximately 33 bytes; requestAckTimeout was set to 2000ms, and requestAcksRequired was set to 1. I used 10ms as I found that 5ms or less caused my unacked queue to build up due to system speed – it simply couldn't keep up. 10ms keeps the load high, but just manageable. YMMV with that param. All of this is done on a single host, over loopback. I ran it on both my airbook, and a well setup RH linux box, and found the same problem.

      At startup, my system logged "expired" requests - meaning reqs that were sent, but for which no ACK, positive or negative, was seen from the broker, within 1.25x the requestAckTimeout (ie, 2500ms). I would let it settle until the unacked queue was stable at or around 0.

      What I found is this: ACKs are normally generated within milliseconds. This was demonstrated by my logging added to the scala ProducerRe* classes, and they are normally seen quickly by my client. But when the actual error occurs, namely that a request is ignored, the ProducerResponse class does encode the correct correlationId; however, a response containing that ID is never sent over the network, as evidenced by my tcpdump traces. In my experience this would take anywhere from 3-15 seconds to occur after the system was warm, meaning that it's 1 out of several hundred on average that shows the condition.

      While I can't attach my client code, I could attach logs; but since my intention is to share the code with LI people, I will wait to see if that's useful here.

      Attachments

        Issue Links

          Activity

            People

              sriramsub Sriram
              benfleis ben fleis
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: