Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-4571

Strange permament socket descriptors increasing leads to "Too many open files"

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Fix Version/s: 1.1.5
    • Component/s: None
    • Labels:
      None
    • Environment:

      Description

      On the two-node cluster there was found strange socket descriptors increasing. lsof -n | grep java shows many rows like"

      java 8380 cassandra 113r unix 0xffff8101a374a080 938348482 socket
      java 8380 cassandra 114r unix 0xffff8101a374a080 938348482 socket
      java 8380 cassandra 115r unix 0xffff8101a374a080 938348482 socket
      java 8380 cassandra 116r unix 0xffff8101a374a080 938348482 socket
      java 8380 cassandra 117r unix 0xffff8101a374a080 938348482 socket
      java 8380 cassandra 118r unix 0xffff8101a374a080 938348482 socket
      java 8380 cassandra 119r unix 0xffff8101a374a080 938348482 socket
      java 8380 cassandra 120r unix 0xffff8101a374a080 938348482 socket
      " And number of this rows constantly increasing. After about 24 hours this situation leads to error.
      We use PHPCassa client. Load is not so high (aroud ~50kb/s on write).

      1. 4571.txt
        13 kB
        Jonathan Ellis

        Activity

        Hide
        brandon.williams Brandon Williams added a comment -

        I've seen this a few times, but never found a cause/resolution, so I'll go ahead and dump what I know:

        • All cases thus far seem to be upgrades, not new installations.
        • 1.1 but less than 1.1.2 doesn't seem to exhibit
        • Cassandra doesn't use unix sockets, at all
        • This is fairly rare and only hits a handful of users
        • some people have this happen on all nodes, some have it happen on only a portion
        • going to such lengths as trying all kinds of different JVM versions and completely switching OSes has not helped

        One user wrote a simple app to track the lost FDs here: http://pastebin.com/faBkJueB and it seemed to correlate with opening one sstable, and another user has corroborated that. Both report heavy reads on that CF.

        No way to reproduce is yet known, I've failed in all my attempts.

        Show
        brandon.williams Brandon Williams added a comment - I've seen this a few times, but never found a cause/resolution, so I'll go ahead and dump what I know: All cases thus far seem to be upgrades, not new installations. 1.1 but less than 1.1.2 doesn't seem to exhibit Cassandra doesn't use unix sockets, at all This is fairly rare and only hits a handful of users some people have this happen on all nodes, some have it happen on only a portion going to such lengths as trying all kinds of different JVM versions and completely switching OSes has not helped One user wrote a simple app to track the lost FDs here: http://pastebin.com/faBkJueB and it seemed to correlate with opening one sstable, and another user has corroborated that. Both report heavy reads on that CF. No way to reproduce is yet known, I've failed in all my attempts.
        Hide
        etobgra Tobias Grahn added a comment -

        I have seen the problem on cassandra 1.1.3 as well. Our 3 node cluster has the same issue. It's a blocker.
        We are using Hector as client and FD increases up to 100K and keeps growing...

        Java (build 1.6.0_32-b05)
        Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

        After a fresh start cassandra uses one unix FD then we put some load on and it keeps growing.

        lsof -p 14597 | grep -i unix
        java 14597 root 43u unix 0xffff88082a3acc80 0t0 42443166 socket

        Put load on cassandra and then it increases

        lsof -p 14597 | grep -i unix | wc -l
        5678
        7654
        .....
        98403

        Show
        etobgra Tobias Grahn added a comment - I have seen the problem on cassandra 1.1.3 as well. Our 3 node cluster has the same issue. It's a blocker. We are using Hector as client and FD increases up to 100K and keeps growing... Java (build 1.6.0_32-b05) Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 x86_64 GNU/Linux After a fresh start cassandra uses one unix FD then we put some load on and it keeps growing. lsof -p 14597 | grep -i unix java 14597 root 43u unix 0xffff88082a3acc80 0t0 42443166 socket Put load on cassandra and then it increases lsof -p 14597 | grep -i unix | wc -l 5678 7654 ..... 98403
        Hide
        etobgra Tobias Grahn added a comment -

        I can reproduce it every time we simulate a specific test case with load using many reads.
        We have a new installation of cassandra 1.1.3.

        So if you want some trace or dump whatever I can give it to you.

        Show
        etobgra Tobias Grahn added a comment - I can reproduce it every time we simulate a specific test case with load using many reads. We have a new installation of cassandra 1.1.3. So if you want some trace or dump whatever I can give it to you.
        Hide
        sergshne Serg Shnerson added a comment -

        It seems that bug is related to Java NIO internals (May be to Thrift framework). Please, read https://forums.oracle.com/forums/thread.jspa?threadID=1146235 for more details and give your thoughts about.
        From topic: "I am submitting this post to highlight a possible NIO "gotcha" in multithreaded applications and pose a couple of questions. We have observed file descriptor resource leakage (eventually leading to server failure) in a server process using NIO within the excellent framework written by Ronny Standtke (http://nioframework.sourceforge.net). Platform is JDK1.6.0_05 on RHEL4. I don't think that this is the same issue as that in connection with TCP CLOSED sockets reported elsewhere - What leaks here are descriptors connected to Unix domain sockets.

        In the framework, SelectableChannels registered in a selector are select()-ed in a single thread that handles data transfer to clients of the selector channels, executing in different threads. When a client shuts down its connection (invoking key.cancel() and key.channel.close()) eventually we get to JRE AbstractInterruptibleChannel::close() and SocketChannelImpl::implCloseSelectableChannel() which does the preClose() - via JNI this dup2()s a statically maintained descriptor (attached to a dummy Unix domain socket) onto the underlying file descriptor (as discussed by Alan Bateman (http://mail.openjdk.java.net/pipermail/core-libs-dev/2008-January/000219.html)). The problem occurs when the select() thread runs at the same time and the cancelled key is seen by SelectorImpl::processDeregisterQueue(). Eventually (in our case) EPollSelectorImpl::implDereg() tests the "channel closed" flag set by AbstractInterruptibleChannel::close() (this is not read-protected by a lock) and executes channel.kill() which closes the underlying file descriptor. If this happens before the preClose() in the other thread, the out-of-sequence dup2() leaks the file descriptor, attached to the UNIX domain socket.

        In the framework mentioned, we don't particularly want to add locking in the select() thread as this would impact other clients of the selector - alternatively a fix is to simply comment out the key.cancel(). channel.close() does the cancel() for us anyway, but after the close()/preClose() has completed, so the select() processing then occurs in the right sequence. (I am notifying Ronny Standtke of this issue independently)."

        See also following links for more information:
        http://stackoverflow.com/questions/7038688/java-nio-causes-file-descriptor-leak
        http://mail-archives.apache.org/mod_mbox/tomcat-users/201201.mbox/%3CCAJkSUv-DDKTCQ-pD7W=QOVmPH1dXeXOvcr+3mCgu05cqpT7Zjg@mail.gmail.com%3E
        http://www.apacheserver.net/HBase-Thrift-for-CDH3U3-leaking-file-descriptors-socket-at1580921.htm

        Show
        sergshne Serg Shnerson added a comment - It seems that bug is related to Java NIO internals (May be to Thrift framework). Please, read https://forums.oracle.com/forums/thread.jspa?threadID=1146235 for more details and give your thoughts about. From topic: "I am submitting this post to highlight a possible NIO "gotcha" in multithreaded applications and pose a couple of questions. We have observed file descriptor resource leakage (eventually leading to server failure) in a server process using NIO within the excellent framework written by Ronny Standtke ( http://nioframework.sourceforge.net ). Platform is JDK1.6.0_05 on RHEL4. I don't think that this is the same issue as that in connection with TCP CLOSED sockets reported elsewhere - What leaks here are descriptors connected to Unix domain sockets. In the framework, SelectableChannels registered in a selector are select()-ed in a single thread that handles data transfer to clients of the selector channels, executing in different threads. When a client shuts down its connection (invoking key.cancel() and key.channel.close()) eventually we get to JRE AbstractInterruptibleChannel::close() and SocketChannelImpl::implCloseSelectableChannel() which does the preClose() - via JNI this dup2()s a statically maintained descriptor (attached to a dummy Unix domain socket) onto the underlying file descriptor (as discussed by Alan Bateman ( http://mail.openjdk.java.net/pipermail/core-libs-dev/2008-January/000219.html )). The problem occurs when the select() thread runs at the same time and the cancelled key is seen by SelectorImpl::processDeregisterQueue(). Eventually (in our case) EPollSelectorImpl::implDereg() tests the "channel closed" flag set by AbstractInterruptibleChannel::close() (this is not read-protected by a lock) and executes channel.kill() which closes the underlying file descriptor. If this happens before the preClose() in the other thread, the out-of-sequence dup2() leaks the file descriptor, attached to the UNIX domain socket. In the framework mentioned, we don't particularly want to add locking in the select() thread as this would impact other clients of the selector - alternatively a fix is to simply comment out the key.cancel(). channel.close() does the cancel() for us anyway, but after the close()/preClose() has completed, so the select() processing then occurs in the right sequence. (I am notifying Ronny Standtke of this issue independently)." See also following links for more information: http://stackoverflow.com/questions/7038688/java-nio-causes-file-descriptor-leak http://mail-archives.apache.org/mod_mbox/tomcat-users/201201.mbox/%3CCAJkSUv-DDKTCQ-pD7W=QOVmPH1dXeXOvcr+3mCgu05cqpT7Zjg@mail.gmail.com%3E http://www.apacheserver.net/HBase-Thrift-for-CDH3U3-leaking-file-descriptors-socket-at1580921.htm
        Hide
        jeromatron Jeremy Hanna added a comment -

        Tobias: is it possible to get the test case and the server setup to try to reproduce? Heap dumps haven't proven very useful thus far.

        Show
        jeromatron Jeremy Hanna added a comment - Tobias: is it possible to get the test case and the server setup to try to reproduce? Heap dumps haven't proven very useful thus far.
        Hide
        sergshne Serg Shnerson added a comment -

        Bug is not recreating with one node cluster

        Show
        sergshne Serg Shnerson added a comment - Bug is not recreating with one node cluster
        Hide
        eperott Per Otterström added a comment -

        Adding few more observation from the 3-node cluster mentioned by Tobias above...

        We have tried to reproduce the problem in a more isolated scenario without success.

        I don't know whether this is of any use or not but we used strace to get a rough view on whats causing the leaked file descriptors. During a session with strace we placed a constant load of reads on the cluster. During this session strace collected ~2.400.000 rows of system calls. The first 2.000.000 have the following pattern:

        1. cat strace.cassandra.1* | sort | head -2000000 | grep open | wc -l
          47
        2. cat strace.cassandra.1* | sort | head -2000000 | grep close | wc -l
          48
        3. cat strace.cassandra.1* | sort | head -2000000 | grep dup2 | wc -l
          20
        4. cat strace.cassandra.1* | sort | head -2000000 | grep fcntl | wc -l
          86

        The last 400.000 collected system calls reflect the time when FD leakage started:

        1. cat strace.cassandra.1* | sort | tail -390000 | grep open | wc -l
          11344
        2. cat strace.cassandra.1* | sort | tail -390000 | grep close | wc -l
          5718
        3. cat strace.cassandra.1* | sort | tail -390000 | grep dup2 | wc -l
          16280
        4. cat strace.cassandra.1* | sort | tail -390000 | grep fcntl | wc -l
          22670

        As you can see, just a small part of the calls to open() and dup2() have a corresponding call to close().

        A glimps of the system calls at the time when the problem starts:

        1. cat strace.cassandra.1* | sort -s -n -k 1 | less
          14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1260-Data.db", O_RDONLY) = 2381
          14:59:38 fstat(2381, {st_mode=S_IFREG|0644, st_size=448850532, ...}) = 0
          14:59:38 fcntl(2381, F_GETFD) = 0
          14:59:38 fcntl(2381, F_SETFD, FD_CLOEXEC) = 0
          14:59:38 fstat(2381, {st_mode=S_IFREG|0644, st_size=448850532, ...}

          ) = 0
          14:59:38 lseek(2381, 0, SEEK_CUR) = 0
          14:59:38 lseek(2381, 53072226, SEEK_SET) = 53072226
          14:59:38 read(2381, "\200\200\4tPREPARE\1\0\4\307\352\fi|J\0\0\0\4P5\337\257\0\3XI"..., 15790) = 15790
          14:59:38 lseek(2381, 0, SEEK_CUR) = 53088016
          14:59:38 read(2381, "\330`\243\311", 4) = 4
          14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1275-Data.db", O_RDONLY) = 2387
          14:59:38 fstat(2387,

          {st_mode=S_IFREG|0644, st_size=9056852, ...}) = 0
          14:59:38 fcntl(2387, F_GETFD) = 0
          14:59:38 fcntl(2387, F_SETFD, FD_CLOEXEC) = 0
          14:59:38 fstat(2387, {st_mode=S_IFREG|0644, st_size=9056852, ...}

          ) = 0
          14:59:38 lseek(2387, 0, SEEK_CUR) = 0
          14:59:38 lseek(2387, 1073276, SEEK_SET) = 1073276
          14:59:38 read(2387, "\200\200\4\360>nd\0\7batchId\0\0\4\307\3528\3760\240\0\0\0\34VOU"..., 18284) = 18284
          14:59:38 lseek(2387, 0, SEEK_CUR) = 1091560
          14:59:38 read(2387, "\301\320\321r", 4) = 4
          14:59:38 dup2(43, 2387) = 2387
          14:59:38 close(2387) = 0
          14:59:38 futex(0x7fa968000eb4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fa968000eb0,

          {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
          14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1260-Data.db", O_RDONLY) = 2389
          14:59:38 fstat(2389, {st_mode=S_IFREG|0644, st_size=448850532, ...}) = 0
          14:59:38 fcntl(2389, F_GETFD) = 0
          14:59:38 fcntl(2389, F_SETFD, FD_CLOEXEC) = 0
          14:59:38 fstat(2389, {st_mode=S_IFREG|0644, st_size=448850532, ...}) = 0
          14:59:38 lseek(2389, 0, SEEK_CUR) = 0
          14:59:38 lseek(2389, 16713707, SEEK_SET) = 16713707
          14:59:38 read(2389, "\200\200\4tgent\0\0\4\307\352\3\25\335\323\0\0\0\4bond\0\7batch"..., 15853) = 15853
          14:59:38 lseek(2389, 0, SEEK_CUR) = 16729560
          14:59:38 read(2389, "pS\r\235", 4) = 4
          14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1275-Data.db", O_RDONLY) = 2401
          14:59:38 fstat(2401, {st_mode=S_IFREG|0644, st_size=9056852, ...}) = 0
          14:59:38 fcntl(2401, F_GETFD) = 0
          14:59:38 fcntl(2401, F_SETFD, FD_CLOEXEC) = 0
          14:59:38 fstat(2401, {st_mode=S_IFREG|0644, st_size=9056852, ...}) = 0
          14:59:38 lseek(2401, 0, SEEK_CUR) = 0
          14:59:38 lseek(2401, 325664, SEEK_SET) = 325664
          14:59:38 read(2401, "\200\200\4\01066\0\t\0010(\0\0\0\20\0\0\0\3\0\0\0\1\t\23\t\1(P5\353\317"..., 18359) = 18359
          14:59:38 lseek(2401, 0, SEEK_CUR) = 344023
          14:59:38 read(2401, "\265.\321p", 4) = 4
          14:59:38 dup2(43, 2401) = 2401
          14:59:38 close(2401) = 0
          14:59:38 futex(0x41e1f804, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41e1f800, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}

          ) = 1
          14:59:38 futex(0x7fae0c3278d0, FUTEX_WAKE_PRIVATE, 1) = 0
          14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1260-Data.db", O_RDONLY) = 2416
          14:59:38 fstat(2416,

          {st_mode=S_IFREG|0644, st_size=448850532, ...}) = 0
          14:59:38 fcntl(2416, F_GETFD) = 0
          14:59:38 fcntl(2416, F_SETFD, FD_CLOEXEC) = 0
          14:59:38 fstat(2416, {st_mode=S_IFREG|0644, st_size=448850532, ...}

          ) = 0
          14:59:38 lseek(2416, 0, SEEK_CUR) = 0
          14:59:38 lseek(2416, 275573213, SEEK_SET) = 275573213
          14:59:38 read(2416, "\200\200\4\0000n\1\0\310\0\16activationCode\0\0\4\307\351\361#"..., 15829) = 15829
          14:59:38 lseek(2416, 0, SEEK_CUR) = 275589042
          14:59:38 read(2416, "\316\223j\235", 4) = 4
          14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1275-Data.db", O_RDONLY) = 2438
          14:59:38 fstat(2438,

          {st_mode=S_IFREG|0644, st_size=9056852, ...}) = 0
          14:59:38 fcntl(2438, F_GETFD) = 0
          14:59:38 fcntl(2438, F_SETFD, FD_CLOEXEC) = 0
          14:59:38 fstat(2438, {st_mode=S_IFREG|0644, st_size=9056852, ...}

          ) = 0
          14:59:38 lseek(2438, 0, SEEK_CUR) = 0
          14:59:38 lseek(2438, 5538661, SEEK_SET) = 5538661
          14:59:38 read(2438, "\200\200\4\0\0\r\1,P5\354\"\0\4\307\352\312^\3450\5\23<\01755643264"..., 18465) = 18465
          14:59:38 lseek(2438, 0, SEEK_CUR) = 5557126
          14:59:38 read(2438, "\232\327\216q", 4) = 4
          14:59:38 dup2(43, 2438) = 2438
          14:59:38 close(2438) = 0

        Show
        eperott Per Otterström added a comment - Adding few more observation from the 3-node cluster mentioned by Tobias above... We have tried to reproduce the problem in a more isolated scenario without success. I don't know whether this is of any use or not but we used strace to get a rough view on whats causing the leaked file descriptors. During a session with strace we placed a constant load of reads on the cluster. During this session strace collected ~2.400.000 rows of system calls. The first 2.000.000 have the following pattern: cat strace.cassandra.1* | sort | head -2000000 | grep open | wc -l 47 cat strace.cassandra.1* | sort | head -2000000 | grep close | wc -l 48 cat strace.cassandra.1* | sort | head -2000000 | grep dup2 | wc -l 20 cat strace.cassandra.1* | sort | head -2000000 | grep fcntl | wc -l 86 The last 400.000 collected system calls reflect the time when FD leakage started: cat strace.cassandra.1* | sort | tail -390000 | grep open | wc -l 11344 cat strace.cassandra.1* | sort | tail -390000 | grep close | wc -l 5718 cat strace.cassandra.1* | sort | tail -390000 | grep dup2 | wc -l 16280 cat strace.cassandra.1* | sort | tail -390000 | grep fcntl | wc -l 22670 As you can see, just a small part of the calls to open() and dup2() have a corresponding call to close(). A glimps of the system calls at the time when the problem starts: cat strace.cassandra.1* | sort -s -n -k 1 | less 14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1260-Data.db", O_RDONLY) = 2381 14:59:38 fstat(2381, {st_mode=S_IFREG|0644, st_size=448850532, ...}) = 0 14:59:38 fcntl(2381, F_GETFD) = 0 14:59:38 fcntl(2381, F_SETFD, FD_CLOEXEC) = 0 14:59:38 fstat(2381, {st_mode=S_IFREG|0644, st_size=448850532, ...} ) = 0 14:59:38 lseek(2381, 0, SEEK_CUR) = 0 14:59:38 lseek(2381, 53072226, SEEK_SET) = 53072226 14:59:38 read(2381, "\200\200\4tPREPARE\1\0\4\307\352\fi|J\0\0\0\4P5\337\257\0\3XI"..., 15790) = 15790 14:59:38 lseek(2381, 0, SEEK_CUR) = 53088016 14:59:38 read(2381, "\330`\243\311", 4) = 4 14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1275-Data.db", O_RDONLY) = 2387 14:59:38 fstat(2387, {st_mode=S_IFREG|0644, st_size=9056852, ...}) = 0 14:59:38 fcntl(2387, F_GETFD) = 0 14:59:38 fcntl(2387, F_SETFD, FD_CLOEXEC) = 0 14:59:38 fstat(2387, {st_mode=S_IFREG|0644, st_size=9056852, ...} ) = 0 14:59:38 lseek(2387, 0, SEEK_CUR) = 0 14:59:38 lseek(2387, 1073276, SEEK_SET) = 1073276 14:59:38 read(2387, "\200\200\4\360>nd\0\7batchId\0\0\4\307\3528\3760\240\0\0\0\34VOU"..., 18284) = 18284 14:59:38 lseek(2387, 0, SEEK_CUR) = 1091560 14:59:38 read(2387, "\301\320\321r", 4) = 4 14:59:38 dup2(43, 2387) = 2387 14:59:38 close(2387) = 0 14:59:38 futex(0x7fa968000eb4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fa968000eb0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1260-Data.db", O_RDONLY) = 2389 14:59:38 fstat(2389, {st_mode=S_IFREG|0644, st_size=448850532, ...}) = 0 14:59:38 fcntl(2389, F_GETFD) = 0 14:59:38 fcntl(2389, F_SETFD, FD_CLOEXEC) = 0 14:59:38 fstat(2389, {st_mode=S_IFREG|0644, st_size=448850532, ...}) = 0 14:59:38 lseek(2389, 0, SEEK_CUR) = 0 14:59:38 lseek(2389, 16713707, SEEK_SET) = 16713707 14:59:38 read(2389, "\200\200\4tgent\0\0\4\307\352\3\25\335\323\0\0\0\4bond\0\7batch"..., 15853) = 15853 14:59:38 lseek(2389, 0, SEEK_CUR) = 16729560 14:59:38 read(2389, "pS\r\235", 4) = 4 14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1275-Data.db", O_RDONLY) = 2401 14:59:38 fstat(2401, {st_mode=S_IFREG|0644, st_size=9056852, ...}) = 0 14:59:38 fcntl(2401, F_GETFD) = 0 14:59:38 fcntl(2401, F_SETFD, FD_CLOEXEC) = 0 14:59:38 fstat(2401, {st_mode=S_IFREG|0644, st_size=9056852, ...}) = 0 14:59:38 lseek(2401, 0, SEEK_CUR) = 0 14:59:38 lseek(2401, 325664, SEEK_SET) = 325664 14:59:38 read(2401, "\200\200\4\01066\0\t\0010(\0\0\0\20\0\0\0\3\0\0\0\1\t\23\t\1(P5\353\317"..., 18359) = 18359 14:59:38 lseek(2401, 0, SEEK_CUR) = 344023 14:59:38 read(2401, "\265.\321p", 4) = 4 14:59:38 dup2(43, 2401) = 2401 14:59:38 close(2401) = 0 14:59:38 futex(0x41e1f804, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41e1f800, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} ) = 1 14:59:38 futex(0x7fae0c3278d0, FUTEX_WAKE_PRIVATE, 1) = 0 14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1260-Data.db", O_RDONLY) = 2416 14:59:38 fstat(2416, {st_mode=S_IFREG|0644, st_size=448850532, ...}) = 0 14:59:38 fcntl(2416, F_GETFD) = 0 14:59:38 fcntl(2416, F_SETFD, FD_CLOEXEC) = 0 14:59:38 fstat(2416, {st_mode=S_IFREG|0644, st_size=448850532, ...} ) = 0 14:59:38 lseek(2416, 0, SEEK_CUR) = 0 14:59:38 lseek(2416, 275573213, SEEK_SET) = 275573213 14:59:38 read(2416, "\200\200\4\0000n\1\0\310\0\16activationCode\0\0\4\307\351\361#"..., 15829) = 15829 14:59:38 lseek(2416, 0, SEEK_CUR) = 275589042 14:59:38 read(2416, "\316\223j\235", 4) = 4 14:59:38 open("/ssd_disks_1/cassandra/data/cake/vouchers/cake-vouchers-he-1275-Data.db", O_RDONLY) = 2438 14:59:38 fstat(2438, {st_mode=S_IFREG|0644, st_size=9056852, ...}) = 0 14:59:38 fcntl(2438, F_GETFD) = 0 14:59:38 fcntl(2438, F_SETFD, FD_CLOEXEC) = 0 14:59:38 fstat(2438, {st_mode=S_IFREG|0644, st_size=9056852, ...} ) = 0 14:59:38 lseek(2438, 0, SEEK_CUR) = 0 14:59:38 lseek(2438, 5538661, SEEK_SET) = 5538661 14:59:38 read(2438, "\200\200\4\0\0\r\1,P5\354\"\0\4\307\352\312^\3450\5\23<\01755643264"..., 18465) = 18465 14:59:38 lseek(2438, 0, SEEK_CUR) = 5557126 14:59:38 read(2438, "\232\327\216q", 4) = 4 14:59:38 dup2(43, 2438) = 2438 14:59:38 close(2438) = 0
        Hide
        jbellis Jonathan Ellis added a comment -

        https://forums.oracle.com/forums/thread.jspa?threadID=1146235

        That thread says, "the race condition with cancel is something we are currently working on in http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6693490," which is marked fixed in java6u18.

        Show
        jbellis Jonathan Ellis added a comment - https://forums.oracle.com/forums/thread.jspa?threadID=1146235 That thread says, "the race condition with cancel is something we are currently working on in http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6693490 ," which is marked fixed in java6u18.
        Hide
        jbellis Jonathan Ellis added a comment -

        Tobias/Per, can you try your test case against 1.1.2 to verify that this was indeed introduce in 1.1.3 somehow?

        Show
        jbellis Jonathan Ellis added a comment - Tobias/Per, can you try your test case against 1.1.2 to verify that this was indeed introduce in 1.1.3 somehow?
        Hide
        joe.miller joe miller added a comment -

        We are seeing this fd socket leak issue on 1.1.2 (after upgrade from 1.0.10)

        Show
        joe.miller joe miller added a comment - We are seeing this fd socket leak issue on 1.1.2 (after upgrade from 1.0.10)
        Hide
        jbellis Jonathan Ellis added a comment -

        Thanks Joe. Can anyone confirm/reject 1.1.1 then?

        Show
        jbellis Jonathan Ellis added a comment - Thanks Joe. Can anyone confirm/reject 1.1.1 then?
        Hide
        tmeighen Tim Meighen added a comment -

        Just wanted to add my experience with this. In our case we saw the issue when we upgraded from 1.0.x to 1.1.x. We are currently running 1.1.2 on Ubuntu 12.04 with Sun JDK 1.6.0_34. I'm pretty sure that we had the issue with 1.1.1 as well.

        With the help of some kind folks on the irc channel I was able to track the vast majority of the leaks to one particular column family. Some of the characteristics of the CF:

        • Heavy read load
        • Data is deleted
        • Accessed using Hector Object Mapping (the only CF in our system that uses this)
        • Relatively small Column Family
        • Row caching enabled *

        Once I was able to pinpoint this table I started reviewing the schema and noticed that in the migration from 1.0.x to 1.1.x we had lost our row cache settings. The original schema defined row_cache_size=20000 but when we upgraded I found that caching="KEYS_ONLY" instead of "ALL". When I re-enabled the row cache we stopped leaking file descriptors. I'm assuming this is because the data was being pulled from memory instead of disk and so the underlying issue was no longer exposed.

        Another thing to note is that our ring (4 nodes at the time, I've upped it to 6 since) was under fairly heavy load and doing a lot of garbage collection.

        Show
        tmeighen Tim Meighen added a comment - Just wanted to add my experience with this. In our case we saw the issue when we upgraded from 1.0.x to 1.1.x. We are currently running 1.1.2 on Ubuntu 12.04 with Sun JDK 1.6.0_34. I'm pretty sure that we had the issue with 1.1.1 as well. With the help of some kind folks on the irc channel I was able to track the vast majority of the leaks to one particular column family. Some of the characteristics of the CF: Heavy read load Data is deleted Accessed using Hector Object Mapping (the only CF in our system that uses this) Relatively small Column Family Row caching enabled * Once I was able to pinpoint this table I started reviewing the schema and noticed that in the migration from 1.0.x to 1.1.x we had lost our row cache settings. The original schema defined row_cache_size=20000 but when we upgraded I found that caching="KEYS_ONLY" instead of "ALL". When I re-enabled the row cache we stopped leaking file descriptors. I'm assuming this is because the data was being pulled from memory instead of disk and so the underlying issue was no longer exposed. Another thing to note is that our ring (4 nodes at the time, I've upped it to 6 since) was under fairly heavy load and doing a lot of garbage collection.
        Hide
        sergshne Serg Shnerson added a comment -

        I have checked 1.1.1 and have found the same issue.

        Show
        sergshne Serg Shnerson added a comment - I have checked 1.1.1 and have found the same issue.
        Hide
        swillcox Steven Willcox added a comment - - edited

        We are also seeing this bug and all nodes eventually run out of file descriptors and crash. It is a blocker for us.

        Show
        swillcox Steven Willcox added a comment - - edited We are also seeing this bug and all nodes eventually run out of file descriptors and crash. It is a blocker for us.
        Hide
        eperott Per Otterström added a comment -

        To verify, we started from scratch. A new installation on 3 servers. And the FD leak is still there. So, with our particular setup we are able to reproduce the bug.

        These are the characteristics of our setup:

        • We have one single CF.
        • Rows are inserted in batches.
        • Rows are red, updated and deleted in a random like pattern.
        • The FD leak seem to start during heavy read load (but can appear during mixed read/write/delete operations as well).
        • We are using Hector to access this single CF.
        • Cassandra configuration is basically standard.

        The FD leaks does not show immediately. It appears once there is ~60M rows in CF.

        Show
        eperott Per Otterström added a comment - To verify, we started from scratch. A new installation on 3 servers. And the FD leak is still there. So, with our particular setup we are able to reproduce the bug. These are the characteristics of our setup: We have one single CF. Rows are inserted in batches. Rows are red, updated and deleted in a random like pattern. The FD leak seem to start during heavy read load (but can appear during mixed read/write/delete operations as well). We are using Hector to access this single CF. Cassandra configuration is basically standard. The FD leaks does not show immediately. It appears once there is ~60M rows in CF.
        Hide
        jbellis Jonathan Ellis added a comment -

        Are you sure you can't reproduce on a single-node cluster?

        Because we're getting conflicting evidence here; on the one hand, strace indicates that the fd leakage is related to file i/o, but if so, you shouldn't need multiple nodes in the cluster to repro.

        Show
        jbellis Jonathan Ellis added a comment - Are you sure you can't reproduce on a single-node cluster? Because we're getting conflicting evidence here; on the one hand, strace indicates that the fd leakage is related to file i/o, but if so, you shouldn't need multiple nodes in the cluster to repro.
        Hide
        sergshne Serg Shnerson added a comment - - edited

        Are you sure you can't reproduce on a single-node cluster?

        My mistake. I've checked it again. Bug also was reproduced with one-node cluster.

        Show
        sergshne Serg Shnerson added a comment - - edited Are you sure you can't reproduce on a single-node cluster? My mistake. I've checked it again. Bug also was reproduced with one-node cluster.
        Hide
        jbellis Jonathan Ellis added a comment -

        The FD leak seem to start during heavy read load (but can appear during mixed read/write/delete operations as well)

        Does this mean that you can reproduce the leak if you stop doing inserts/updates entirely and just do reads?

        What kind of reads are you doing? index lookups? seq scan? named-columns-from-row? slice-from-row?

        Show
        jbellis Jonathan Ellis added a comment - The FD leak seem to start during heavy read load (but can appear during mixed read/write/delete operations as well) Does this mean that you can reproduce the leak if you stop doing inserts/updates entirely and just do reads? What kind of reads are you doing? index lookups? seq scan? named-columns-from-row? slice-from-row?
        Hide
        etobgra Tobias Grahn added a comment -

        Yes. I have tried to run traffic for a couple of days which does mixed operations and cassandra is still running.
        However, i just started to run a traffic testcase which does only reads and the issue is back directly.
        I don't know if it matters but all my reads returns empty responses which is expected since these keys should be deleted
        The testcase does a lookup on the key only.

        E.g I have file with a number of keys and then my testcase try to lookup a row using that key and the response is empty since these keys does not exists.

        Show
        etobgra Tobias Grahn added a comment - Yes. I have tried to run traffic for a couple of days which does mixed operations and cassandra is still running. However, i just started to run a traffic testcase which does only reads and the issue is back directly. I don't know if it matters but all my reads returns empty responses which is expected since these keys should be deleted The testcase does a lookup on the key only. E.g I have file with a number of keys and then my testcase try to lookup a row using that key and the response is empty since these keys does not exists.
        Hide
        swillcox Steven Willcox added a comment - - edited

        We can reliably reproduce this issue in our test environment every day. Start the servers up in the morning and by the end of the workday the number of open file descriptors reaches from 40-60K and the nodes stop responding. We have turned row caching off and it still has this problem. You can contact me if you think remotely debugging this issue will help in determining what is causing this.

        We have reproduced this using just one node.

        Show
        swillcox Steven Willcox added a comment - - edited We can reliably reproduce this issue in our test environment every day. Start the servers up in the morning and by the end of the workday the number of open file descriptors reaches from 40-60K and the nodes stop responding. We have turned row caching off and it still has this problem. You can contact me if you think remotely debugging this issue will help in determining what is causing this. We have reproduced this using just one node.
        Hide
        brandon.williams Brandon Williams added a comment -

        Thanks to a reproducible example from Viktor Kuzmin, I've bisected this down to CASSANDRA-4116.

        Show
        brandon.williams Brandon Williams added a comment - Thanks to a reproducible example from Viktor Kuzmin, I've bisected this down to CASSANDRA-4116 .
        Hide
        jbellis Jonathan Ellis added a comment -

        Patch to fix 4116 sstable iterator leak.

        Show
        jbellis Jonathan Ellis added a comment - Patch to fix 4116 sstable iterator leak.
        Hide
        brandon.williams Brandon Williams added a comment -

        Issue does not repro with this patch.

        Show
        brandon.williams Brandon Williams added a comment - Issue does not repro with this patch.
        Hide
        beobal Sam Tunnicliffe added a comment -

        LGTM +1

        Show
        beobal Sam Tunnicliffe added a comment - LGTM +1
        Hide
        jbellis Jonathan Ellis added a comment -

        committed! thanks everyone for the help tracking this down.

        Show
        jbellis Jonathan Ellis added a comment - committed! thanks everyone for the help tracking this down.
        Hide
        j.casares Joaquin Casares added a comment -

        This can still be seen in 1.1.5 if the user is running Java 1.6.0_29. The current solution is to upgrade to 1.6.0_35.

        Show
        j.casares Joaquin Casares added a comment - This can still be seen in 1.1.5 if the user is running Java 1.6.0_29. The current solution is to upgrade to 1.6.0_35.
        Hide
        cherro Chris Herron added a comment -

        For anybody else encountering this unbounded socket growth problem on 1.1.5, note that while upgrading 1.6.0_35 seemed to help, a longer load test still reproduced the symptom. FWIW, upgradesstables ran for a period during this particular test - unclear if the increased compaction activity contributed.

        Show
        cherro Chris Herron added a comment - For anybody else encountering this unbounded socket growth problem on 1.1.5, note that while upgrading 1.6.0_35 seemed to help, a longer load test still reproduced the symptom. FWIW, upgradesstables ran for a period during this particular test - unclear if the increased compaction activity contributed.
        Hide
        jbellis Jonathan Ellis added a comment -

        Related to CASSANDRA-4740?

        Show
        jbellis Jonathan Ellis added a comment - Related to CASSANDRA-4740 ?
        Hide
        cherro Chris Herron added a comment -

        FYI was able to reproduce the symptom on Cassandra 1.1.6.
        @Jonathan Ellis Re: CASSANDRA-4740 and whether it relates to this:

        • Haven't looked across all nodes for phantom connections yet
        • Have searched across all logs - found a single instance of "Timed out replaying hints".
        • Mina mentioned that "Nodes running earlier kernels (2.6.39, 3.0, 3.1) haven't exhibited this". We are seeing this on Linux kernel 2.6.35 with Java 1.6.0_35.
        Show
        cherro Chris Herron added a comment - FYI was able to reproduce the symptom on Cassandra 1.1.6. @ Jonathan Ellis Re: CASSANDRA-4740 and whether it relates to this: Haven't looked across all nodes for phantom connections yet Have searched across all logs - found a single instance of "Timed out replaying hints". Mina mentioned that "Nodes running earlier kernels (2.6.39, 3.0, 3.1) haven't exhibited this". We are seeing this on Linux kernel 2.6.35 with Java 1.6.0_35.
        Hide
        cherro Chris Herron added a comment -

        We are also seeing errors similar to those reported in CASSANDRA-4687.
        Could this be a side-effect of that problem? In SSTableSliceIterator as of commit e1b10590e84189b92af168e33a63c14c3ca1f5fa, if the constructor key equality assertion fails, fileToClose does not get closed.

        Show
        cherro Chris Herron added a comment - We are also seeing errors similar to those reported in CASSANDRA-4687 . Could this be a side-effect of that problem? In SSTableSliceIterator as of commit e1b10590e84189b92af168e33a63c14c3ca1f5fa , if the constructor key equality assertion fails, fileToClose does not get closed.
        Hide
        jbellis Jonathan Ellis added a comment -

        Are you then seeing that assertion failure logged?

        Show
        jbellis Jonathan Ellis added a comment - Are you then seeing that assertion failure logged?
        Hide
        cherro Chris Herron added a comment -

        Yes, seeing the key equality AssertionErrors from two SSTable iterators: SSTableSliceIterator:60 and SSTableNamesIterator:72.
        Also seeing same EOF error reported by T Jake Luciani in CASSANDRA-4687:

        java.io.IOError: java.io.EOFException: unable to seek to position 61291844 in /redacted/cassandra/data/test1/redacted/test1-redacted-hf-1-Data.db (59874704 bytes) in read-only mode
                at org.apache.cassandra.io.util.CompressedSegmentedFile.getSegment(CompressedSegmentedFile.java:69)
                at org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:898)
                at org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:50)
                at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:67)
                at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:79)
                at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:256)
                at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64)
                at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1345)
                at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1207)
                at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1142)
                at org.apache.cassandra.db.Table.getRow(Table.java:378)
                at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)
                at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:51)
                at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:662)
        Caused by: java.io.EOFException: unable to seek to position 61291844 in /redacted/cassandra/data/test1/redacted/test1-redacted-hf-1-Data.db (59874704 bytes) in read-only mode
                at org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:253)
                at org.apache.cassandra.io.util.CompressedSegmentedFile.getSegment(CompressedSegmentedFile.java:64)
                ... 16 more
        
        Show
        cherro Chris Herron added a comment - Yes, seeing the key equality AssertionErrors from two SSTable iterators: SSTableSliceIterator:60 and SSTableNamesIterator:72. Also seeing same EOF error reported by T Jake Luciani in CASSANDRA-4687 : java.io.IOError: java.io.EOFException: unable to seek to position 61291844 in /redacted/cassandra/data/test1/redacted/test1-redacted-hf-1-Data.db (59874704 bytes) in read-only mode at org.apache.cassandra.io.util.CompressedSegmentedFile.getSegment(CompressedSegmentedFile.java:69) at org.apache.cassandra.io.sstable.SSTableReader.getFileDataInput(SSTableReader.java:898) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.<init>(SSTableSliceIterator.java:50) at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:67) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:79) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:256) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1345) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1207) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1142) at org.apache.cassandra.db.Table.getRow(Table.java:378) at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69) at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:51) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662) Caused by: java.io.EOFException: unable to seek to position 61291844 in /redacted/cassandra/data/test1/redacted/test1-redacted-hf-1-Data.db (59874704 bytes) in read-only mode at org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:253) at org.apache.cassandra.io.util.CompressedSegmentedFile.getSegment(CompressedSegmentedFile.java:64) ... 16 more
        Hide
        cherro Chris Herron added a comment -

        Tested this patch: https://gist.github.com/2f10efd3922fab9a095e applied to a build from branch cassandra-1.1 at commit 4d2e5e73b127dc0b335176ddc1dec1f0244e7f6d.

        This definitely reduced the growth of socket FD handles, but there must be other scenarios like this in the codebase because it did grow beyond 2 which is where I've seen it at steady state under normal conditions.

        The AssertionErrors from CASSANDRA-4687 were so spurious that they were pegging disk IO. When I ran the same test again with assertions disabled for the org.apache.cassandra.db.columniterator package, I saw many errors like those described in CASSANDRA-4417 ("invalid counter shard detected"). See my comments in that issue.

        Shouldn't CASSANDRA-4571 be re-opened?

        Show
        cherro Chris Herron added a comment - Tested this patch: https://gist.github.com/2f10efd3922fab9a095e applied to a build from branch cassandra-1.1 at commit 4d2e5e73b127dc0b335176ddc1dec1f0244e7f6d. This definitely reduced the growth of socket FD handles, but there must be other scenarios like this in the codebase because it did grow beyond 2 which is where I've seen it at steady state under normal conditions. The AssertionErrors from CASSANDRA-4687 were so spurious that they were pegging disk IO. When I ran the same test again with assertions disabled for the org.apache.cassandra.db.columniterator package, I saw many errors like those described in CASSANDRA-4417 ("invalid counter shard detected"). See my comments in that issue. Shouldn't CASSANDRA-4571 be re-opened?
        Hide
        jbellis Jonathan Ellis added a comment -

        If it's caused by 4687 assertion errors? No.

        Disabling key cache is a workaround for 4687 btw.

        Show
        jbellis Jonathan Ellis added a comment - If it's caused by 4687 assertion errors? No. Disabling key cache is a workaround for 4687 btw.

          People

          • Assignee:
            jbellis Jonathan Ellis
            Reporter:
            sergshne Serg Shnerson
            Reviewer:
            Sam Tunnicliffe
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development