Cassandra
  1. Cassandra
  2. CASSANDRA-3585

Intermittent exceptions seen in cassandra 1.0.5 during Reads.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Fix Version/s: 0.8.9, 1.0.6
    • Component/s: Core
    • Labels:
      None
    • Environment:

      RHEL 2.6.32-71.el6.x86_64.
      RAM - 28GB
      11 CPUs of 2.6GHz

      Description

      In my test setup I have cassandra db provisioned with cassandra 0.8.7. The setup is of two data centers. I have upgraded the cassandra to the latest version 1.0.5. I'm seeing following exceptions in cassandra logs -

      ERROR [RequestResponseStage:32] 2011-12-06 14:46:08,150 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[RequestResponseStage:32,5,main]
      java.io.IOError: java.io.EOFException
      at org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
      at org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:126)
      at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45)
      at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:662)
      Caused by: java.io.EOFException
      at java.io.DataInputStream.readFully(DataInputStream.java:180)
      at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:100)
      at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81)
      at org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)
      ... 6 more

      RF is set to DC1:3,DC2:3 and I/m doing the operations with CL=Local_Quorum.
      I have run nodetool scrub on all the nodes in the ring to verify if it solves the issue but it didn't.

      Thanks,
      Shantanu

      1. 3585.txt
        2 kB
        Jonathan Ellis
      2. 3585-v2.txt
        4 kB
        Jonathan Ellis
      3. CassandraLogs.tar.bz2
        9.29 MB
        Shantanu
      4. metap_system.log.zip
        2.71 MB
        Shantanu
      5. metap_system.log.zip
        3.87 MB
        Shantanu

        Activity

        Hide
        Jonathan Ellis added a comment -

        Could you try with the attached patch? I've added more debugging information to the digest read path.

        Show
        Jonathan Ellis added a comment - Could you try with the attached patch? I've added more debugging information to the digest read path.
        Hide
        Jonathan Ellis added a comment -

        The only place I can see any changes here from 0.8 is in the treatment of counters, are those involved in the exception you are seeing?

        Show
        Jonathan Ellis added a comment - The only place I can see any changes here from 0.8 is in the treatment of counters, are those involved in the exception you are seeing?
        Hide
        Shantanu added a comment -

        Following is the stack trace for this patch -

        ERROR [RequestResponseStage:25] 2011-12-08 11:14:56,289 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[RequestResponseStage:25,5,main]
        java.lang.AssertionError: 1067877
        at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:99)
        at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81)
        at org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)
        at org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:126)
        at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45)
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

        Show
        Shantanu added a comment - Following is the stack trace for this patch - ERROR [RequestResponseStage:25] 2011-12-08 11:14:56,289 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread [RequestResponseStage:25,5,main] java.lang.AssertionError: 1067877 at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:99) at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81) at org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64) at org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:126) at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
        Hide
        Shantanu added a comment -

        We are not using counters

        Show
        Shantanu added a comment - We are not using counters
        Hide
        Shantanu added a comment -

        Please find the system logs for one cassandra node with debug information.

        Show
        Shantanu added a comment - Please find the system logs for one cassandra node with debug information.
        Hide
        Jonathan Ellis added a comment -

        Interesting. The exception seems to happen in clusters, and it's always the same incorrect digest length, 1067877.

        Show
        Jonathan Ellis added a comment - Interesting. The exception seems to happen in clusters, and it's always the same incorrect digest length, 1067877.
        Hide
        Jonathan Ellis added a comment -

        The first cluster starts at 11:12:26 and finishes a few ms later. The next starts at 11:31:14 11:34:12. It's almost like what I'd expect to see from version negotiation but it's a little too slow to finish.

        Show
        Jonathan Ellis added a comment - The first cluster starts at 11:12:26 and finishes a few ms later. The next starts at 11:31:14 11:34:12. It's almost like what I'd expect to see from version negotiation but it's a little too slow to finish.
        Hide
        Jonathan Ellis added a comment -

        Another point: this node doesn't serve as request coordinator until much LATER than the exceptions. So why is there a ReadCallback in the stacktrace?

        Show
        Jonathan Ellis added a comment - Another point: this node doesn't serve as request coordinator until much LATER than the exceptions. So why is there a ReadCallback in the stacktrace?
        Hide
        Jonathan Ellis added a comment -

        Sorry, best I can do for now is give you another debug patch (replacing the first). If you turn logging up to TRACE, that would also be useful.

        Show
        Jonathan Ellis added a comment - Sorry, best I can do for now is give you another debug patch (replacing the first). If you turn logging up to TRACE, that would also be useful.
        Hide
        Shantanu added a comment -

        Logs with Trace and new patch.

        Show
        Shantanu added a comment - Logs with Trace and new patch.
        Hide
        Shantanu added a comment -

        Hi Jonathan,

        As of now the issue is only observed with multi DC deployment of cassandra1.0.5. The issue can be seen with a clean db created with 1.0.5 as well.

        Thanks,
        Shantanu

        Show
        Shantanu added a comment - Hi Jonathan, As of now the issue is only observed with multi DC deployment of cassandra1.0.5. The issue can be seen with a clean db created with 1.0.5 as well. Thanks, Shantanu
        Hide
        Sylvain Lebresne added a comment -

        Are you using some internode encryption ?
        Also, if you could provide all the logs for all the nodes involved, that could help (with Jonathan patch). And of course, if you have an easy way to reproduce, that would be even better.

        Show
        Sylvain Lebresne added a comment - Are you using some internode encryption ? Also, if you could provide all the logs for all the nodes involved, that could help (with Jonathan patch). And of course, if you have an easy way to reproduce, that would be even better.
        Hide
        Jonathan Ellis added a comment -

        In particular, logs from 192.168.136.234 would be useful since that is what sends the first bad response.

        Show
        Jonathan Ellis added a comment - In particular, logs from 192.168.136.234 would be useful since that is what sends the first bad response.
        Hide
        Jonathan Ellis added a comment -

        Some interesting things from the trace log:

        The request that errors out:

        DEBUG [pool-2-thread-39] 2011-12-08 21:07:21,815 StorageProxy.java (line 627) Command/ConsistencyLevel is SliceByNamesReadCommand(table='KeyspaceMetadata', key=313030303237383033, columnParent='QueryPath(columnFamilyName='CF_MailboxSummary', superColumnName='null', columnName='null')', columns=[666f6c6465725f73756d6d6172795f4c057f31b9ba31138e6c1ff57f552f66,666f6c6465725f73756d6d6172795f5112276526f2330392edfc4b47cf4725,666f6c6465725f73756d6d6172795ff75e02920bc73235abb48ead4be65444,])/QUORUM
        

        The data read, and the digest read that errors out.

        DEBUG [pool-2-thread-39] 2011-12-08 21:07:21,836 StorageProxy.java (line 648) reading data from /192.168.136.231
        TRACE [pool-2-thread-39] 2011-12-08 21:07:21,836 MessagingService.java (line 412) /192.168.128.229 sending READ version 3 to 3242@/192.168.136.231
        DEBUG [pool-2-thread-39] 2011-12-08 21:07:21,869 StorageProxy.java (line 668) reading digest from /192.168.128.226
        TRACE [pool-2-thread-39] 2011-12-08 21:07:21,870 MessagingService.java (line 412) /192.168.128.229 sending READ version 3 to 3280@/192.168.128.226
        

        Failing to parse the reply:

        DEBUG [Thread-8] 2011-12-08 21:07:21,877 IncomingTcpConnection.java (line 149) Message body is 47 bytes
        TRACE [Thread-8] 2011-12-08 21:07:21,885 MessagingService.java (line 502) /192.168.128.229 received REQUEST_RESPONSE from 3280@/192.168.136.234
        DEBUG [RequestResponseStage:18] 2011-12-08 21:07:21,885 ResponseVerbHandler.java (line 44) Processing response on a callback from 3280@/192.168.136.234
        DEBUG [RequestResponseStage:18] 2011-12-08 21:07:21,891 AbstractRowResolver.java (line 64) Preprocessing 47 bytes
        ERROR [RequestResponseStage:18] 2011-12-08 21:07:21,903 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[RequestResponseStage:18,5,main]
        

        It sure looks like something fishy is going on with the message IDs – we sent 3280@ to .226, but here we are processing a response from .234. It's also worth noting that the message size (47 bytes) is neither a digest reply (21 bytes) or the data for the read here (16460 bytes), BUT we do see other 47 byte replies in the log around this time, that are replies to a Mutation sent earlier:

        DEBUG [Thread-15] 2011-12-08 21:07:21,891 IncomingTcpConnection.java (line 149) Message body is 47 bytes
        TRACE [Thread-15] 2011-12-08 21:07:21,902 MessagingService.java (line 502) /192.168.128.229 received REQUEST_RESPONSE from 3145@/192.168.128.227
        

        I think we have a race in the MessagingService ID/Callback Map somewhere.

        Show
        Jonathan Ellis added a comment - Some interesting things from the trace log: The request that errors out: DEBUG [pool-2-thread-39] 2011-12-08 21:07:21,815 StorageProxy.java (line 627) Command/ConsistencyLevel is SliceByNamesReadCommand(table='KeyspaceMetadata', key=313030303237383033, columnParent='QueryPath(columnFamilyName='CF_MailboxSummary', superColumnName='null', columnName='null')', columns=[666f6c6465725f73756d6d6172795f4c057f31b9ba31138e6c1ff57f552f66,666f6c6465725f73756d6d6172795f5112276526f2330392edfc4b47cf4725,666f6c6465725f73756d6d6172795ff75e02920bc73235abb48ead4be65444,])/QUORUM The data read, and the digest read that errors out. DEBUG [pool-2-thread-39] 2011-12-08 21:07:21,836 StorageProxy.java (line 648) reading data from /192.168.136.231 TRACE [pool-2-thread-39] 2011-12-08 21:07:21,836 MessagingService.java (line 412) /192.168.128.229 sending READ version 3 to 3242@/192.168.136.231 DEBUG [pool-2-thread-39] 2011-12-08 21:07:21,869 StorageProxy.java (line 668) reading digest from /192.168.128.226 TRACE [pool-2-thread-39] 2011-12-08 21:07:21,870 MessagingService.java (line 412) /192.168.128.229 sending READ version 3 to 3280@/192.168.128.226 Failing to parse the reply: DEBUG [Thread-8] 2011-12-08 21:07:21,877 IncomingTcpConnection.java (line 149) Message body is 47 bytes TRACE [Thread-8] 2011-12-08 21:07:21,885 MessagingService.java (line 502) /192.168.128.229 received REQUEST_RESPONSE from 3280@/192.168.136.234 DEBUG [RequestResponseStage:18] 2011-12-08 21:07:21,885 ResponseVerbHandler.java (line 44) Processing response on a callback from 3280@/192.168.136.234 DEBUG [RequestResponseStage:18] 2011-12-08 21:07:21,891 AbstractRowResolver.java (line 64) Preprocessing 47 bytes ERROR [RequestResponseStage:18] 2011-12-08 21:07:21,903 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[RequestResponseStage:18,5,main] It sure looks like something fishy is going on with the message IDs – we sent 3280@ to .226, but here we are processing a response from .234. It's also worth noting that the message size (47 bytes) is neither a digest reply (21 bytes) or the data for the read here (16460 bytes), BUT we do see other 47 byte replies in the log around this time, that are replies to a Mutation sent earlier: DEBUG [Thread-15] 2011-12-08 21:07:21,891 IncomingTcpConnection.java (line 149) Message body is 47 bytes TRACE [Thread-15] 2011-12-08 21:07:21,902 MessagingService.java (line 502) /192.168.128.229 received REQUEST_RESPONSE from 3145@/192.168.128.227 I think we have a race in the MessagingService ID/Callback Map somewhere.
        Hide
        Jonathan Ellis added a comment -

        Later on, we do get the reply we were expecting from .226, but by then the callback mapping has been destroyed:

        DEBUG [Thread-17] 2011-12-08 21:07:22,140 IncomingTcpConnection.java (line 149) Message body is 21 bytes
        TRACE [Thread-17] 2011-12-08 21:07:22,145 MessagingService.java (line 502) /192.168.128.229 received REQUEST_RESPONSE from 3280@/192.168.128.226
        
        Show
        Jonathan Ellis added a comment - Later on, we do get the reply we were expecting from .226, but by then the callback mapping has been destroyed: DEBUG [Thread-17] 2011-12-08 21:07:22,140 IncomingTcpConnection.java (line 149) Message body is 21 bytes TRACE [Thread-17] 2011-12-08 21:07:22,145 MessagingService.java (line 502) /192.168.128.229 received REQUEST_RESPONSE from 3280@/192.168.128.226
        Hide
        Jonathan Ellis added a comment -

        There's one more instance of the 3280@ id in the log:

        TRACE [Thread-5] 2011-12-08 21:07:28,659 MessagingService.java (line 502) /192.168.128.229 received GOSSIP_DIGEST_ACK from 3280@/192.168.128.225
        

        But GOSSIP_DIGEST_ACK is a "command," not a response, i.e., that 3280 is an id generated on .225, so there's nothing abnormal there.

        Show
        Jonathan Ellis added a comment - There's one more instance of the 3280@ id in the log: TRACE [Thread-5] 2011-12-08 21:07:28,659 MessagingService.java (line 502) /192.168.128.229 received GOSSIP_DIGEST_ACK from 3280@/192.168.128.225 But GOSSIP_DIGEST_ACK is a "command," not a response, i.e., that 3280 is an id generated on .225, so there's nothing abnormal there.
        Hide
        Shantanu added a comment -

        Here are the logs for all the cassandra servers in the ring.
        These are the logs from the test I did today. We have removed 192.168.128.229 and 192.168.136.234 since these machines were having some issues. These logs are with fresh install of cassandra 1.0.5 with 8 nodes.

        Show
        Shantanu added a comment - Here are the logs for all the cassandra servers in the ring. These are the logs from the test I did today. We have removed 192.168.128.229 and 192.168.136.234 since these machines were having some issues. These logs are with fresh install of cassandra 1.0.5 with 8 nodes.
        Hide
        Jonathan Ellis added a comment -

        Brandon wrote a script (http://aep.appspot.com/display/PWH-9bzElErPO9ydpC2ay3I5Jys/) to check for the counterpart of this broken response – a request sent to .234 that did NOT get a response. But nothing like that is in the logs, everything else matches up.

        Show
        Jonathan Ellis added a comment - Brandon wrote a script ( http://aep.appspot.com/display/PWH-9bzElErPO9ydpC2ay3I5Jys/ ) to check for the counterpart of this broken response – a request sent to .234 that did NOT get a response. But nothing like that is in the logs, everything else matches up.
        Hide
        Jonathan Ellis added a comment -

        Shantanu, do you have encryption enabled?

        Show
        Jonathan Ellis added a comment - Shantanu, do you have encryption enabled?
        Hide
        Shantanu added a comment -

        No, the internode encryption is set to none.

        Show
        Shantanu added a comment - No, the internode encryption is set to none.
        Hide
        Jonathan Ellis added a comment -

        Figured it out. This is actually a second manifestation of CASSANDRA-3577, a bug in the multi-DC write optimization. Pasting from there:

        Node A (DC1) sends a write to node B (DC2), which forwards to node C (DC2). Node C replies to node A with the message ID it received from node B. If the message generation on A and B is far enough apart, then A will not have a callback for the reply and all you will see happen is the write timeout (at CL > ONE). But if A does have a callback (for a different operation) waiting, then A will try to apply the mutation response to that callback, which (if the callback is for a read) will result in the error see in CASSANDRA-3585.

        For 1.0.6 we've disabled that optimization; for 1.1 we've fixed it by pre-generating extra callback IDs on the coordinator (node A in this example) and forwarding those cross-DC as well.

        Show
        Jonathan Ellis added a comment - Figured it out. This is actually a second manifestation of CASSANDRA-3577 , a bug in the multi-DC write optimization. Pasting from there: Node A (DC1) sends a write to node B (DC2), which forwards to node C (DC2). Node C replies to node A with the message ID it received from node B. If the message generation on A and B is far enough apart, then A will not have a callback for the reply and all you will see happen is the write timeout (at CL > ONE). But if A does have a callback (for a different operation) waiting, then A will try to apply the mutation response to that callback, which (if the callback is for a read) will result in the error see in CASSANDRA-3585 . For 1.0.6 we've disabled that optimization; for 1.1 we've fixed it by pre-generating extra callback IDs on the coordinator (node A in this example) and forwarding those cross-DC as well.

          People

          • Assignee:
            Unassigned
            Reporter:
            Shantanu
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development