Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-3135

Unexpected delay before fetch response transmission

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1, 0.10.1.0, 0.10.1.1
    • Fix Version/s: 1.1.0
    • Component/s: None
    • Labels:
      None

      Description

      From the user list, Krzysztof Ciesielski reports the following:

      Scenario description:
      First, a producer writes 500000 elements into a topic
      Then, a consumer starts to read, polling in a loop.
      When "max.partition.fetch.bytes" is set to a relatively small value, each
      "consumer.poll()" returns a batch of messages.
      If this value is left as default, the output tends to look like this:

      Poll returned 13793 elements
      Poll returned 13793 elements
      Poll returned 13793 elements
      Poll returned 13793 elements
      Poll returned 0 elements
      Poll returned 0 elements
      Poll returned 0 elements
      Poll returned 0 elements
      Poll returned 13793 elements
      Poll returned 13793 elements

      As we can see, there are weird "gaps" when poll returns 0 elements for some
      time. What is the reason for that? Maybe there are some good practices
      about setting "max.partition.fetch.bytes" which I don't follow?

      The gist to reproduce this problem is here: https://gist.github.com/kciesielski/054bb4359a318aa17561.

      After some initial investigation, the delay appears to be in the server's networking layer. Basically I see a delay of 5 seconds from the time that Selector.send() is invoked in SocketServer.Processor with the fetch response to the time that the send is completed. Using netstat in the middle of the delay shows the following output:

      tcp4       0      0  10.191.0.30.55455      10.191.0.30.9092       ESTABLISHED
      tcp4       0 102400  10.191.0.30.9092       10.191.0.30.55454      ESTABLISHED
      

      From this, it looks like the data reaches the send buffer, but needs to be flushed.

        Issue Links

          Activity

          Hide
          hachikuji Jason Gustafson added a comment -

          Interestingly, I can reproduce this pause even using the old consumer on 0.8.2 if I change the receive buffer size to 32K (as is the default for the new consumer). Similarly, by changing the new consumer's receive buffer to 64K (as is the default for the old consumer), the problem seems to go away. This may be a viable workaround until the root cause is known. Note I have only tested this on my macbook (OS-X 10.10.5) with Oracle's jdk (1.8.0_65-b17).

          Show
          hachikuji Jason Gustafson added a comment - Interestingly, I can reproduce this pause even using the old consumer on 0.8.2 if I change the receive buffer size to 32K (as is the default for the new consumer). Similarly, by changing the new consumer's receive buffer to 64K (as is the default for the old consumer), the problem seems to go away. This may be a viable workaround until the root cause is known. Note I have only tested this on my macbook (OS-X 10.10.5) with Oracle's jdk (1.8.0_65-b17).
          Hide
          ijuma Ismael Juma added a comment -

          Jason Gustafson, do you know why the receive buffer size was changed to 32k in the new consumer?

          Show
          ijuma Ismael Juma added a comment - Jason Gustafson , do you know why the receive buffer size was changed to 32k in the new consumer?
          Hide
          rangadi Raghu Angadi added a comment -
          tcp4       0      0  10.191.0.30.55455      10.191.0.30.9092       ESTABLISHED
          tcp4       0 102400  10.191.0.30.9092       10.191.0.30.55454      ESTABLISHED
          

          This shows server is ok. It is trying to send and already flushed/wrote on its end of the tcp connection. If you check netstat for the side of the connection (local port 55454) you should see data in receive buffers. Still don't know why consumer is not reading.

          I am also seeing similar weird behavior. with 64K receiver buffer reduces the intensity of the problem, but does not fix it.

          Show
          rangadi Raghu Angadi added a comment - tcp4 0 0 10.191.0.30.55455 10.191.0.30.9092 ESTABLISHED tcp4 0 102400 10.191.0.30.9092 10.191.0.30.55454 ESTABLISHED This shows server is ok. It is trying to send and already flushed/wrote on its end of the tcp connection. If you check netstat for the side of the connection (local port 55454) you should see data in receive buffers. Still don't know why consumer is not reading. I am also seeing similar weird behavior. with 64K receiver buffer reduces the intensity of the problem, but does not fix it.
          Hide
          hachikuji Jason Gustafson added a comment -

          Raghu Angadi Are you on OSX? So far, I've only seen this reproduced on macs.

          Show
          hachikuji Jason Gustafson added a comment - Raghu Angadi Are you on OSX? So far, I've only seen this reproduced on macs.
          Hide
          kciesielski Krzysiek Ciesielski added a comment -

          Jason Gustafson Actually, this error has been originally spotted on Linux. To be precise, a colleague of mine was preparing some Kafka evaluation for his customer on EC2 with AWS Linux AMI (64 bit). Kafka cluster has been configured with Docker using "java8" as base image. After noticing strange lags, we were able to reproduce this on OSX with code from attached gist.

          Show
          kciesielski Krzysiek Ciesielski added a comment - Jason Gustafson Actually, this error has been originally spotted on Linux. To be precise, a colleague of mine was preparing some Kafka evaluation for his customer on EC2 with AWS Linux AMI (64 bit). Kafka cluster has been configured with Docker using "java8" as base image. After noticing strange lags, we were able to reproduce this on OSX with code from attached gist.
          Hide
          rangadi Raghu Angadi added a comment -

          Jason Gustafson, mine is also on Linux with Java8 (docker containers on GCE). In one of the tests with default receive buffers, select() took alternated between waiting 0 ms and 20-30ms.. reading a few KB each time.

          Show
          rangadi Raghu Angadi added a comment - Jason Gustafson , mine is also on Linux with Java8 (docker containers on GCE). In one of the tests with default receive buffers, select() took alternated between waiting 0 ms and 20-30ms.. reading a few KB each time.
          Hide
          hachikuji Jason Gustafson added a comment -

          Raghu Angadi Krzysiek Ciesielski Thanks for letting me know. I tried to reproduce previously on Linux, but wasn't able to. Going to bump this to critical and see if I can isolate the problem.

          Show
          hachikuji Jason Gustafson added a comment - Raghu Angadi Krzysiek Ciesielski Thanks for letting me know. I tried to reproduce previously on Linux, but wasn't able to. Going to bump this to critical and see if I can isolate the problem.
          Hide
          hachikuji Jason Gustafson added a comment -

          I made a little progress understanding this problem. The 5 second pause that we're seeing is caused by the TCP persist timer, which kicks in when the remote window size gets small enough. You can see this by watching the traffic through tcpdump and confirming the persist timeout event with 'netstat -s'. Clearly there are some cases where the client is not reading the data off the socket fast enough, which is causing the window size to fall and the server to back off.

          Show
          hachikuji Jason Gustafson added a comment - I made a little progress understanding this problem. The 5 second pause that we're seeing is caused by the TCP persist timer, which kicks in when the remote window size gets small enough. You can see this by watching the traffic through tcpdump and confirming the persist timeout event with 'netstat -s'. Clearly there are some cases where the client is not reading the data off the socket fast enough, which is causing the window size to fall and the server to back off.
          Hide
          jkreps Jay Kreps added a comment -

          This seems like an important one to get to the bottom of...

          Show
          jkreps Jay Kreps added a comment - This seems like an important one to get to the bottom of...
          Hide
          jeffwidman Jeff Widman added a comment -

          Can the tags on this issue be updated to note that it applies to 0.10.x versions as well?

          At least, that's what it seems from reading through the issue description.

          Show
          jeffwidman Jeff Widman added a comment - Can the tags on this issue be updated to note that it applies to 0.10.x versions as well? At least, that's what it seems from reading through the issue description.
          Hide
          ewencp Ewen Cheslack-Postava added a comment -

          Jeff Widman I've updated the affected versions, and I'll track this for the 0.10.2.0 release. Do we have an idea of how critical this issue is? It seems like it should be hit fairly commonly (based on ideas that it's due to TCP persist timer issues), but we've seen few comments since Mar/Apr of earlier this year. Is this still a critical issue or should we downgrade it to something that would be a good improvement and make it a major/minor/nice-to-have?

          Show
          ewencp Ewen Cheslack-Postava added a comment - Jeff Widman I've updated the affected versions, and I'll track this for the 0.10.2.0 release. Do we have an idea of how critical this issue is? It seems like it should be hit fairly commonly (based on ideas that it's due to TCP persist timer issues), but we've seen few comments since Mar/Apr of earlier this year. Is this still a critical issue or should we downgrade it to something that would be a good improvement and make it a major/minor/nice-to-have?
          Hide
          ewencp Ewen Cheslack-Postava added a comment -

          Also, /cc Jason Gustafson since he was the one that made the most progress on this.

          Show
          ewencp Ewen Cheslack-Postava added a comment - Also, /cc Jason Gustafson since he was the one that made the most progress on this.
          Hide
          jeffwidman Jeff Widman added a comment -

          It's not currently a critical issue for my company. Typically when we're considering upgrading we look at outstanding bugs to evaluate whether to upgrade or wait, so I just wanted the tags to be corrected. Thanks Ewen Cheslack-Postava for handling.

          Show
          jeffwidman Jeff Widman added a comment - It's not currently a critical issue for my company. Typically when we're considering upgrading we look at outstanding bugs to evaluate whether to upgrade or wait, so I just wanted the tags to be corrected. Thanks Ewen Cheslack-Postava for handling.

            People

            • Assignee:
              hachikuji Jason Gustafson
              Reporter:
              hachikuji Jason Gustafson
            • Votes:
              8 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:

                Development