Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-2096

Enable keepalive socket option for broker to prevent socket leak

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.8.2.1
    • 0.9.0.0
    • network
    • None

    Description

      We run a Kafka 0.8.2.1 cluster in AWS with large number of producers (> 10000). Also the number of producer instances scale up and down significantly on a daily basis.

      The issue we found is that after 10 days, the open file descriptor count will approach the limit of 32K. An investigation of these open file descriptors shows that a significant portion of these are from client instances that are terminated during scaling down. Somehow they still show as "ESTABLISHED" in netstat. We suspect that the AWS firewall between the client and broker causes this issue.

      We attempted to use "keepalive" socket option to reduce this socket leak on broker and it appears to be working. Specifically, we added this line to kafka.network.Acceptor.accept():

      socketChannel.socket().setKeepAlive(true)

      It is confirmed during our experiment of this change that entries in netstat where the client instance is terminated were probed as configured in operating system. After configured number of probes, the OS determined that the peer is no longer alive and the entry is removed, possibly after an error in Kafka to read from the channel and closing the channel. Also, our experiment shows that after a few days, the instance was able to keep a stable low point of open file descriptor count, compared with other instances where the low point keeps increasing day to day.

      Attachments

        1. patch.diff
          0.7 kB
          Allen Wang

        Activity

          junrao Jun Rao added a comment -

          allenxwang, that seems to be a good fix. Do you want to submit a patch?

          junrao Jun Rao added a comment - allenxwang , that seems to be a good fix. Do you want to submit a patch?
          allenxwang Allen Wang added a comment -

          junrao, yes I would like submit a patch.

          One thing to consider is whether we want to make this configurable. My understanding is that TCP keepalive should not affect client. The only side effect is increase of network traffic due to the probes. On the hand, making it configurable is less intrusive.

          allenxwang Allen Wang added a comment - junrao , yes I would like submit a patch. One thing to consider is whether we want to make this configurable. My understanding is that TCP keepalive should not affect client. The only side effect is increase of network traffic due to the probes. On the hand, making it configurable is less intrusive.
          junrao Jun Rao added a comment -

          We probably don't need to make this configurable. The default linux setting is the following and shouldn't add too much overhead. Plus, the clients already have keepAlive set.
          tcp_keepalive_time = 7200 (seconds)
          tcp_keepalive_intvl = 75 (seconds)
          tcp_keepalive_probes = 9 (number of probes)

          junrao Jun Rao added a comment - We probably don't need to make this configurable. The default linux setting is the following and shouldn't add too much overhead. Plus, the clients already have keepAlive set. tcp_keepalive_time = 7200 (seconds) tcp_keepalive_intvl = 75 (seconds) tcp_keepalive_probes = 9 (number of probes)
          allenxwang Allen Wang added a comment -

          Patch for the fix

          allenxwang Allen Wang added a comment - Patch for the fix
          allenxwang Allen Wang added a comment -

          To verify the fix, the socket connections on broker shown from netstat -o should have "keepalive" in the end of the line:

          tcp6 0 0 xyz-:7101 ip-10-81-144-131.:48779 ESTABLISHED keepalive (7111.94/0/0)

          allenxwang Allen Wang added a comment - To verify the fix, the socket connections on broker shown from netstat -o should have "keepalive" in the end of the line: tcp6 0 0 xyz-:7101 ip-10-81-144-131.:48779 ESTABLISHED keepalive (7111.94/0/0)
          junrao Jun Rao added a comment -

          Thanks for the patch. +1 and committed to trunk.

          junrao Jun Rao added a comment - Thanks for the patch. +1 and committed to trunk.
          yazgoo yazgoo added a comment -

          This issue seems also to affect 0.8.1 branch (since accept() method did not change the socket initalisation).
          Is it possible to mark it for 0.8.1.2 also ?
          I can submit a patch if need be.

          yazgoo yazgoo added a comment - This issue seems also to affect 0.8.1 branch (since accept() method did not change the socket initalisation). Is it possible to mark it for 0.8.1.2 also ? I can submit a patch if need be.
          alex.m3tal Alex the Rocker added a comment - - edited

          We also have got the same issue with Kafka 0.8.1.1, is it possible to have the fix in 0.8.1.2? using conntrack-tools we observe brokers with huge (up to 14000) UNREPLIED sessions.

          Question is: is it required to open a new JIRA to report the same issue with 0.8.1.1 ?

          alex.m3tal Alex the Rocker added a comment - - edited We also have got the same issue with Kafka 0.8.1.1, is it possible to have the fix in 0.8.1.2? using conntrack-tools we observe brokers with huge (up to 14000) UNREPLIED sessions. Question is: is it required to open a new JIRA to report the same issue with 0.8.1.1 ?
          faisal.siddiqui Faisal added a comment -

          Does this solution also resolve following error in spark streaming direct mode connecting to Kafka?
          Too many open files, java.net.SocketException
          After running 5-10 days with 10 seconds interval , my spark streaming get this error on driver node that i only see in driver log file.
          Kafka version: 0.8.2.0
          Spark streaming: 1.5.0-cdh5.5.6

          faisal.siddiqui Faisal added a comment - Does this solution also resolve following error in spark streaming direct mode connecting to Kafka? Too many open files, java.net.SocketException After running 5-10 days with 10 seconds interval , my spark streaming get this error on driver node that i only see in driver log file. Kafka version: 0.8.2.0 Spark streaming: 1.5.0-cdh5.5.6

          People

            allenxwang Allen Wang
            allenxwang Allen Wang
            Jun Rao Jun Rao
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: