Cassandra
  1. Cassandra
  2. CASSANDRA-5668

NPE in net.OutputTcpConnection when tracing is enabled

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.6
    • Component/s: None
    • Labels:

      Description

      I get multiple NullPointerException when trying to trace INSERT statements.

      To reproduce:

      $ ccm create -v git:trunk
      $ ccm populate -n 3
      $ ccm start
      $ ccm node1 cqlsh < 5668_npe_ddl.cql
      $ ccm node1 cqlsh < 5668_npe_insert.cql
      

      And see many exceptions like this in the logs of node1:

      ERROR [WRITE-/127.0.0.3] 2013-06-19 14:54:35,885 OutboundTcpConnection.java (line 197) error writing to /127.0.0.3
      java.lang.NullPointerException
              at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:182)
              at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:144)
      

      This is similar to CASSANDRA-5658 and is the reason that npe_ddl and npe_insert are separate files.

      1. 5668-followup.txt
        3 kB
        Jonathan Ellis
      2. 5668.txt
        22 kB
        Jonathan Ellis
      3. system.log
        41 kB
        Ryan McGuire
      4. 5668-assert-2.txt
        1 kB
        Jonathan Ellis
      5. 5668-assert.txt
        0.9 kB
        Jonathan Ellis
      6. 5668-logs.tar.gz
        8 kB
        Ryan McGuire
      7. 5668_npe_ddl.cql
        0.1 kB
        Ryan McGuire
      8. 5668_npe_insert.cql
        1 kB
        Ryan McGuire

        Issue Links

          Activity

          Hide
          Ryan McGuire added a comment -

          +1 - all my tests are passing now. Thanks!

          Show
          Ryan McGuire added a comment - +1 - all my tests are passing now. Thanks!
          Hide
          Jonathan Ellis added a comment -

          Merged to trunk.

          Show
          Jonathan Ellis added a comment - Merged to trunk.
          Hide
          Ryan McGuire added a comment -

          That works in cassandra-1.2 - if you commit to trunk, I'll test there too.

          Show
          Ryan McGuire added a comment - That works in cassandra-1.2 - if you commit to trunk, I'll test there too.
          Hide
          Jonathan Ellis added a comment -

          fix pushed in 110d283afd780774a44368b17177b5e8e781e37f

          Show
          Jonathan Ellis added a comment - fix pushed in 110d283afd780774a44368b17177b5e8e781e37f
          Hide
          Ryan McGuire added a comment -

          I've retested with the second patch, it's the same.

          Show
          Ryan McGuire added a comment - I've retested with the second patch, it's the same.
          Hide
          Jonathan Ellis added a comment -

          Did you get the followup patch in that? Should have addressed it.

          Show
          Jonathan Ellis added a comment - Did you get the followup patch in that? Should have addressed it.
          Hide
          Ryan McGuire added a comment -

          The NullPointerException is gone, however, I see an unexpected change in the trace after this commit.

          Prior to the patch, I do this:

          $ ccm create 5668
          Current cluster is now: 5668
          $ ccm populate -n 2:2
          $ ccm start
          $ ccm node1 cqlsh
          Connected to 5668 at 127.0.0.1:9160.
          [cqlsh 4.0.0 | Cassandra 2.0-SNAPSHOT | CQL spec 3.1.0 | Thrift protocol 19.37.0]
          Use HELP for help.
          cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1':2, 'dc2':2};
          cqlsh> CREATE TABLE test.test (id int PRIMARY KEY, value text);
          cqlsh> tracing on;
          Now tracing requests.
          cqlsh> INSERT INTO test.test (id, value) VALUES ( 5, 'asdf');
          
          Tracing session: 5d3bdaa0-da87-11e2-9eaa-35db2404c433
          
           activity                                                       | timestamp    | source    | source_elapsed
          ----------------------------------------------------------------+--------------+-----------+----------------
                                                       execute_cql3_query | 11:29:29,423 | 127.0.0.1 |              0
           Parsing INSERT INTO test.test (id, value) VALUES ( 5, 'asdf'); | 11:29:29,424 | 127.0.0.1 |           1082
                                                       Peparing statement | 11:29:29,425 | 127.0.0.1 |           2241
                                        Determining replicas for mutation | 11:29:29,425 | 127.0.0.1 |           2617
                                            Sending message to /127.0.0.2 | 11:29:29,432 | 127.0.0.1 |           9321
                                           Acquiring switchLock read lock | 11:29:29,433 | 127.0.0.1 |          10376
                                                   Appending to commitlog | 11:29:29,433 | 127.0.0.1 |          10424
                                                  Adding to test memtable | 11:29:29,433 | 127.0.0.1 |          10544
                                            Sending message to /127.0.0.3 | 11:29:29,435 | 127.0.0.1 |          12658
                                         Message received from /127.0.0.1 | 11:29:29,436 | 127.0.0.2 |            102
                                         Message received from /127.0.0.1 | 11:29:29,438 | 127.0.0.3 |            103
                                  Enqueuing forwarded write to /127.0.0.4 | 11:29:29,441 | 127.0.0.3 |           3596
                                           Acquiring switchLock read lock | 11:29:29,441 | 127.0.0.3 |           3686
                                                   Appending to commitlog | 11:29:29,441 | 127.0.0.3 |           3704
                                           Acquiring switchLock read lock | 11:29:29,442 | 127.0.0.2 |           5768
                                                  Adding to test memtable | 11:29:29,442 | 127.0.0.3 |           3759
                                                   Appending to commitlog | 11:29:29,442 | 127.0.0.2 |           5821
                                            Sending message to /127.0.0.4 | 11:29:29,442 | 127.0.0.3 |           3834
                                                  Adding to test memtable | 11:29:29,442 | 127.0.0.2 |           5896
                                         Enqueuing response to /127.0.0.1 | 11:29:29,442 | 127.0.0.3 |           4307
                                         Enqueuing response to /127.0.0.1 | 11:29:29,443 | 127.0.0.2 |           6602
                                         Message received from /127.0.0.3 | 11:29:29,444 | 127.0.0.4 |             92
                                            Sending message to /127.0.0.1 | 11:29:29,444 | 127.0.0.2 |           8037
                                         Message received from /127.0.0.2 | 11:29:29,445 | 127.0.0.1 |             36
                                      Processing response from /127.0.0.2 | 11:29:29,445 | 127.0.0.1 |            147
                                           Acquiring switchLock read lock | 11:29:29,452 | 127.0.0.4 |           8791
                                                   Appending to commitlog | 11:29:29,452 | 127.0.0.4 |           8835
                                                  Adding to test memtable | 11:29:29,452 | 127.0.0.4 |           8897
                                         Enqueuing response to /127.0.0.1 | 11:29:29,453 | 127.0.0.4 |           9542
                                            Sending message to /127.0.0.1 | 11:29:29,454 | 127.0.0.4 |          10021
                                         Message received from /127.0.0.4 | 11:29:29,454 | 127.0.0.1 |           9660
                                      Processing response from /127.0.0.4 | 11:29:29,455 | 127.0.0.1 |           9804
                                                         Request complete | 11:29:29,435 | 127.0.0.1 |          12668
          

          That's a 2x2 multi-dc cluster with a RF of 2 in each dc. I write a single INSERT and I see four nodes append to their commit log. All is well.

          After the patch, I see this trace:

          Tracing session: c3da4d00-da87-11e2-9a95-35db2404c433
          
           activity                                                       | timestamp    | source    | source_elapsed
          ----------------------------------------------------------------+--------------+-----------+----------------
                                                       execute_cql3_query | 11:32:21,587 | 127.0.0.1 |              0
           Parsing INSERT INTO test.test (id, value) VALUES ( 5, 'asdf'); | 11:32:21,587 | 127.0.0.1 |            495
                                                       Peparing statement | 11:32:21,588 | 127.0.0.1 |            962
                                        Determining replicas for mutation | 11:32:21,588 | 127.0.0.1 |           1108
                                            Sending message to /127.0.0.2 | 11:32:21,590 | 127.0.0.1 |           3341
                                            Sending message to /127.0.0.3 | 11:32:21,590 | 127.0.0.1 |           3920
                                         Message received from /127.0.0.1 | 11:32:21,592 | 127.0.0.2 |             91
                                         Message received from /127.0.0.1 | 11:32:21,593 | 127.0.0.3 |             93
                                           Acquiring switchLock read lock | 11:32:21,593 | 127.0.0.2 |           1398
                                                   Appending to commitlog | 11:32:21,593 | 127.0.0.2 |           1447
                                                  Adding to test memtable | 11:32:21,593 | 127.0.0.2 |           1517
                                  Enqueuing forwarded write to /127.0.0.4 | 11:32:21,594 | 127.0.0.3 |           1266
                                         Enqueuing response to /127.0.0.1 | 11:32:21,594 | 127.0.0.2 |           2573
                                           Acquiring switchLock read lock | 11:32:21,594 | 127.0.0.3 |           1343
                                                   Appending to commitlog | 11:32:21,594 | 127.0.0.3 |           1360
                                                  Adding to test memtable | 11:32:21,594 | 127.0.0.3 |           1412
                                         Enqueuing response to /127.0.0.1 | 11:32:21,595 | 127.0.0.3 |           2060
                                         Message received from /127.0.0.2 | 11:32:21,595 | 127.0.0.1 |           null
                                            Sending message to /127.0.0.1 | 11:32:21,595 | 127.0.0.2 |           2850
                                            Sending message to /127.0.0.1 | 11:32:21,595 | 127.0.0.3 |           2442
                                      Processing response from /127.0.0.2 | 11:32:21,595 | 127.0.0.1 |           null
                                            Sending message to /127.0.0.4 | 11:32:21,595 | 127.0.0.3 |           2442
                                         Message received from /127.0.0.3 | 11:32:21,595 | 127.0.0.1 |           null
                                      Processing response from /127.0.0.3 | 11:32:21,595 | 127.0.0.1 |           null
                                         Message received from /127.0.0.3 | 11:32:21,597 | 127.0.0.4 |             95
                                           Acquiring switchLock read lock | 11:32:21,598 | 127.0.0.4 |           1040
                                                   Appending to commitlog | 11:32:21,598 | 127.0.0.4 |           1087
                                                  Adding to test memtable | 11:32:21,598 | 127.0.0.4 |           1391
                                         Enqueuing response to /127.0.0.1 | 11:32:21,599 | 127.0.0.4 |           1885
                                         Message received from /127.0.0.4 | 11:32:21,600 | 127.0.0.1 |           null
                                            Sending message to /127.0.0.1 | 11:32:21,600 | 127.0.0.4 |           3290
                                      Processing response from /127.0.0.4 | 11:32:21,600 | 127.0.0.1 |           null
                                                         Request complete | 11:32:21,592 | 127.0.0.1 |           5564
          

          127.0.0.1 (which also happens to be the coordinator) didn't write anything to it's commit log (according to the trace at least). Reproducible on cassandra-1.2 and trunk.

          Show
          Ryan McGuire added a comment - The NullPointerException is gone, however, I see an unexpected change in the trace after this commit. Prior to the patch, I do this: $ ccm create 5668 Current cluster is now: 5668 $ ccm populate -n 2:2 $ ccm start $ ccm node1 cqlsh Connected to 5668 at 127.0.0.1:9160. [cqlsh 4.0.0 | Cassandra 2.0-SNAPSHOT | CQL spec 3.1.0 | Thrift protocol 19.37.0] Use HELP for help. cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1':2, 'dc2':2}; cqlsh> CREATE TABLE test.test (id int PRIMARY KEY, value text); cqlsh> tracing on; Now tracing requests. cqlsh> INSERT INTO test.test (id, value) VALUES ( 5, 'asdf'); Tracing session: 5d3bdaa0-da87-11e2-9eaa-35db2404c433 activity | timestamp | source | source_elapsed ----------------------------------------------------------------+--------------+-----------+---------------- execute_cql3_query | 11:29:29,423 | 127.0.0.1 | 0 Parsing INSERT INTO test.test (id, value) VALUES ( 5, 'asdf'); | 11:29:29,424 | 127.0.0.1 | 1082 Peparing statement | 11:29:29,425 | 127.0.0.1 | 2241 Determining replicas for mutation | 11:29:29,425 | 127.0.0.1 | 2617 Sending message to /127.0.0.2 | 11:29:29,432 | 127.0.0.1 | 9321 Acquiring switchLock read lock | 11:29:29,433 | 127.0.0.1 | 10376 Appending to commitlog | 11:29:29,433 | 127.0.0.1 | 10424 Adding to test memtable | 11:29:29,433 | 127.0.0.1 | 10544 Sending message to /127.0.0.3 | 11:29:29,435 | 127.0.0.1 | 12658 Message received from /127.0.0.1 | 11:29:29,436 | 127.0.0.2 | 102 Message received from /127.0.0.1 | 11:29:29,438 | 127.0.0.3 | 103 Enqueuing forwarded write to /127.0.0.4 | 11:29:29,441 | 127.0.0.3 | 3596 Acquiring switchLock read lock | 11:29:29,441 | 127.0.0.3 | 3686 Appending to commitlog | 11:29:29,441 | 127.0.0.3 | 3704 Acquiring switchLock read lock | 11:29:29,442 | 127.0.0.2 | 5768 Adding to test memtable | 11:29:29,442 | 127.0.0.3 | 3759 Appending to commitlog | 11:29:29,442 | 127.0.0.2 | 5821 Sending message to /127.0.0.4 | 11:29:29,442 | 127.0.0.3 | 3834 Adding to test memtable | 11:29:29,442 | 127.0.0.2 | 5896 Enqueuing response to /127.0.0.1 | 11:29:29,442 | 127.0.0.3 | 4307 Enqueuing response to /127.0.0.1 | 11:29:29,443 | 127.0.0.2 | 6602 Message received from /127.0.0.3 | 11:29:29,444 | 127.0.0.4 | 92 Sending message to /127.0.0.1 | 11:29:29,444 | 127.0.0.2 | 8037 Message received from /127.0.0.2 | 11:29:29,445 | 127.0.0.1 | 36 Processing response from /127.0.0.2 | 11:29:29,445 | 127.0.0.1 | 147 Acquiring switchLock read lock | 11:29:29,452 | 127.0.0.4 | 8791 Appending to commitlog | 11:29:29,452 | 127.0.0.4 | 8835 Adding to test memtable | 11:29:29,452 | 127.0.0.4 | 8897 Enqueuing response to /127.0.0.1 | 11:29:29,453 | 127.0.0.4 | 9542 Sending message to /127.0.0.1 | 11:29:29,454 | 127.0.0.4 | 10021 Message received from /127.0.0.4 | 11:29:29,454 | 127.0.0.1 | 9660 Processing response from /127.0.0.4 | 11:29:29,455 | 127.0.0.1 | 9804 Request complete | 11:29:29,435 | 127.0.0.1 | 12668 That's a 2x2 multi-dc cluster with a RF of 2 in each dc. I write a single INSERT and I see four nodes append to their commit log. All is well. After the patch, I see this trace: Tracing session: c3da4d00-da87-11e2-9a95-35db2404c433 activity | timestamp | source | source_elapsed ----------------------------------------------------------------+--------------+-----------+---------------- execute_cql3_query | 11:32:21,587 | 127.0.0.1 | 0 Parsing INSERT INTO test.test (id, value) VALUES ( 5, 'asdf'); | 11:32:21,587 | 127.0.0.1 | 495 Peparing statement | 11:32:21,588 | 127.0.0.1 | 962 Determining replicas for mutation | 11:32:21,588 | 127.0.0.1 | 1108 Sending message to /127.0.0.2 | 11:32:21,590 | 127.0.0.1 | 3341 Sending message to /127.0.0.3 | 11:32:21,590 | 127.0.0.1 | 3920 Message received from /127.0.0.1 | 11:32:21,592 | 127.0.0.2 | 91 Message received from /127.0.0.1 | 11:32:21,593 | 127.0.0.3 | 93 Acquiring switchLock read lock | 11:32:21,593 | 127.0.0.2 | 1398 Appending to commitlog | 11:32:21,593 | 127.0.0.2 | 1447 Adding to test memtable | 11:32:21,593 | 127.0.0.2 | 1517 Enqueuing forwarded write to /127.0.0.4 | 11:32:21,594 | 127.0.0.3 | 1266 Enqueuing response to /127.0.0.1 | 11:32:21,594 | 127.0.0.2 | 2573 Acquiring switchLock read lock | 11:32:21,594 | 127.0.0.3 | 1343 Appending to commitlog | 11:32:21,594 | 127.0.0.3 | 1360 Adding to test memtable | 11:32:21,594 | 127.0.0.3 | 1412 Enqueuing response to /127.0.0.1 | 11:32:21,595 | 127.0.0.3 | 2060 Message received from /127.0.0.2 | 11:32:21,595 | 127.0.0.1 | null Sending message to /127.0.0.1 | 11:32:21,595 | 127.0.0.2 | 2850 Sending message to /127.0.0.1 | 11:32:21,595 | 127.0.0.3 | 2442 Processing response from /127.0.0.2 | 11:32:21,595 | 127.0.0.1 | null Sending message to /127.0.0.4 | 11:32:21,595 | 127.0.0.3 | 2442 Message received from /127.0.0.3 | 11:32:21,595 | 127.0.0.1 | null Processing response from /127.0.0.3 | 11:32:21,595 | 127.0.0.1 | null Message received from /127.0.0.3 | 11:32:21,597 | 127.0.0.4 | 95 Acquiring switchLock read lock | 11:32:21,598 | 127.0.0.4 | 1040 Appending to commitlog | 11:32:21,598 | 127.0.0.4 | 1087 Adding to test memtable | 11:32:21,598 | 127.0.0.4 | 1391 Enqueuing response to /127.0.0.1 | 11:32:21,599 | 127.0.0.4 | 1885 Message received from /127.0.0.4 | 11:32:21,600 | 127.0.0.1 | null Sending message to /127.0.0.1 | 11:32:21,600 | 127.0.0.4 | 3290 Processing response from /127.0.0.4 | 11:32:21,600 | 127.0.0.1 | null Request complete | 11:32:21,592 | 127.0.0.1 | 5564 127.0.0.1 (which also happens to be the coordinator) didn't write anything to it's commit log (according to the trace at least). Reproducible on cassandra-1.2 and trunk.
          Hide
          Jonathan Ellis added a comment -

          committed.

          Show
          Jonathan Ellis added a comment - committed.
          Hide
          Sylvain Lebresne added a comment -

          +1 on that follow up.

          Show
          Sylvain Lebresne added a comment - +1 on that follow up.
          Hide
          Jonathan Ellis added a comment -

          followup patch to make session cleanup check for REQUEST_RESPONSE instead of local-ness, to keep from destroying session prematurely in RMVH forwarding case.

          Show
          Jonathan Ellis added a comment - followup patch to make session cleanup check for REQUEST_RESPONSE instead of local-ness, to keep from destroying session prematurely in RMVH forwarding case.
          Hide
          Jonathan Ellis added a comment -

          Fixed and committed.

          Show
          Jonathan Ellis added a comment - Fixed and committed.
          Hide
          Sylvain Lebresne added a comment -

          I don't think this patch can be committed as is: in both TracingExecutorService and ExpiredTracingState, the package declaration is before the license header, and I can't let that slide.

          But with that fixed, +1.

          Show
          Sylvain Lebresne added a comment - I don't think this patch can be committed as is: in both TracingExecutorService and ExpiredTracingState, the package declaration is before the license header, and I can't let that slide. But with that fixed, +1.
          Hide
          Jonathan Ellis added a comment -

          Good idea to check for REQUEST_RESPONSE, although it's not quite as easy as it sounds since we still need to be able to inject the TraceState into the executor stage. Patch attached.

          (Note that once the session is closed we won't know elapsed time anymore. I don't see a good way around this.)

          Show
          Jonathan Ellis added a comment - Good idea to check for REQUEST_RESPONSE, although it's not quite as easy as it sounds since we still need to be able to inject the TraceState into the executor stage. Patch attached. (Note that once the session is closed we won't know elapsed time anymore. I don't see a good way around this.)
          Hide
          Sylvain Lebresne added a comment -

          For what it's worth, I think that for the 2nd problem, another option might be to make Tracing.initializeMessage behave slightly differently depending on the message type. So if the state doesn't exist but the message type is a REQUEST_RESPONSE, we could create the state and set it in the threadLocal, but not save it in the global state map.

          It's a bit of a hack though, but it slightly bother me to leave this to expiration either so ....

          Show
          Sylvain Lebresne added a comment - For what it's worth, I think that for the 2nd problem, another option might be to make Tracing.initializeMessage behave slightly differently depending on the message type. So if the state doesn't exist but the message type is a REQUEST_RESPONSE, we could create the state and set it in the threadLocal, but not save it in the global state map. It's a bit of a hack though, but it slightly bother me to leave this to expiration either so ....
          Hide
          Jonathan Ellis added a comment - - edited

          Okay, here's what's happening. I added logging of session cleanup:

           INFO [Thrift:1] 2013-06-19 23:36:51,719 Tracing.java (line 176) session 0702a620-d963-11e2-832d-53376523a4a2 is complete
          
          java.lang.AssertionError: Asked to trace TYPE:MUTATION VERB:MUTATION for session 0702a620-d963-11e2-832d-53376523a4a2 but that state does not exist
          

          cqlsh is requesting QUORUM CL (or ONE?) so once that's achieved the coordinator sends success to the client and closes the tracing session.

          if other messages have not yet gone out, then we error.

          But it gets worse...

          Once the coordinator's state is discarded, any late-arriving replies will create a new, "non-local" session. Since the coordinator will not send any messages again for this session – which is the trigger we use on replicas to indicate "we're done" – the nonlocal session will persist indefinitely, "leaking" memory.

          I think we can solve both of these:

          1. Make a static TraceState method that only needs the sessionid to be passed in to log an event. OTC can use this to avoid having to look up tracestate at all; if it's cleared out, not a problem.
          2. Make Tracing.sessions an expiring map so sessions we don't clean up manually still get removed

          Alternatively we could just go with #2 by itself and not try to cleanup manually at all. Average case memory used will be worse, but maybe that is okay since we assume only a tiny fraction of requests are traced at all.

          What do you think Sylvain Lebresne?

          Show
          Jonathan Ellis added a comment - - edited Okay, here's what's happening. I added logging of session cleanup: INFO [Thrift:1] 2013-06-19 23:36:51,719 Tracing.java (line 176) session 0702a620-d963-11e2-832d-53376523a4a2 is complete java.lang.AssertionError: Asked to trace TYPE:MUTATION VERB:MUTATION for session 0702a620-d963-11e2-832d-53376523a4a2 but that state does not exist cqlsh is requesting QUORUM CL (or ONE?) so once that's achieved the coordinator sends success to the client and closes the tracing session. if other messages have not yet gone out, then we error. But it gets worse... Once the coordinator's state is discarded, any late-arriving replies will create a new, "non-local" session. Since the coordinator will not send any messages again for this session – which is the trigger we use on replicas to indicate "we're done" – the nonlocal session will persist indefinitely, "leaking" memory. I think we can solve both of these: Make a static TraceState method that only needs the sessionid to be passed in to log an event. OTC can use this to avoid having to look up tracestate at all; if it's cleared out, not a problem. Make Tracing.sessions an expiring map so sessions we don't clean up manually still get removed Alternatively we could just go with #2 by itself and not try to cleanup manually at all. Average case memory used will be worse, but maybe that is okay since we assume only a tiny fraction of requests are traced at all. What do you think Sylvain Lebresne ?
          Hide
          Jonathan Ellis added a comment - - edited

          Are we actually creating multiple sessions?

          Edit: Yes.

          Show
          Jonathan Ellis added a comment - - edited Are we actually creating multiple sessions? Edit: Yes.
          Hide
          Jonathan Ellis added a comment -

          So, .2 is sending two messages for this session, and the first is deleting the session when it's done. (We know that the session is getting created correctly since the first patch, that checks at message send time, works fine.)

          The part I don't understand is, why two messages from .2? There should only be one (responding to .1).

          (This could be problematic for cross-dc replication, although so far tracing seems to be working for CASSANDRA-5632. But for same-dc, one message per replica should be straightforward.)

          Show
          Jonathan Ellis added a comment - So, .2 is sending two messages for this session, and the first is deleting the session when it's done. (We know that the session is getting created correctly since the first patch, that checks at message send time, works fine.) The part I don't understand is, why two messages from .2? There should only be one (responding to .1). (This could be problematic for cross-dc replication, although so far tracing seems to be working for CASSANDRA-5632 . But for same-dc, one message per replica should be straightforward.)
          Hide
          Ryan McGuire added a comment - - edited

          OK, that prodcued the assertion errors. See attached system.log

          Also, I saw a lot of this in the cqlsh terminal:

          <stdin>:9:Request did not complete within rpc_timeout.
          <stdin>:10:Request did not complete within rpc_timeout.
          <stdin>:11:'NoneType' object is not iterable
          <stdin>:12:Request did not complete within rpc_timeout.
          <stdin>:13:Request did not complete within rpc_timeout.
          <stdin>:14:'NoneType' object is not iterable
          <stdin>:15:'NoneType' object is not iterable
          <stdin>:16:Request did not complete within rpc_timeout.
          <stdin>:17:'NoneType' object is not iterable
          <stdin>:18:Request did not complete within rpc_timeout.
          <stdin>:19:Request did not complete within rpc_timeout.
          <stdin>:20:'NoneType' object is not iterable
          <stdin>:21:'NoneType' object is not iterable
          <stdin>:22:'NoneType' object is not iterable
          
          Show
          Ryan McGuire added a comment - - edited OK, that prodcued the assertion errors. See attached system.log Also, I saw a lot of this in the cqlsh terminal: <stdin>:9:Request did not complete within rpc_timeout. <stdin>:10:Request did not complete within rpc_timeout. <stdin>:11:'NoneType' object is not iterable <stdin>:12:Request did not complete within rpc_timeout. <stdin>:13:Request did not complete within rpc_timeout. <stdin>:14:'NoneType' object is not iterable <stdin>:15:'NoneType' object is not iterable <stdin>:16:Request did not complete within rpc_timeout. <stdin>:17:'NoneType' object is not iterable <stdin>:18:Request did not complete within rpc_timeout. <stdin>:19:Request did not complete within rpc_timeout. <stdin>:20:'NoneType' object is not iterable <stdin>:21:'NoneType' object is not iterable <stdin>:22:'NoneType' object is not iterable
          Hide
          Jonathan Ellis added a comment -

          Hmm. Take 2.

          Show
          Jonathan Ellis added a comment - Hmm. Take 2.
          Hide
          Ryan McGuire added a comment -

          I ran with the patch, but I did not see any assertion errors.

          Show
          Ryan McGuire added a comment - I ran with the patch, but I did not see any assertion errors.
          Hide
          Jonathan Ellis added a comment -

          attached additional assert to see where we are trying to trace without setting up the session first

          Show
          Jonathan Ellis added a comment - attached additional assert to see where we are trying to trace without setting up the session first
          Hide
          Ryan McGuire added a comment -

          This is what one the traces looks like that failed:

          127.0.0.1	78	Parsing statement	Thrift:1
          127.0.0.1	606	Peparing statement	Thrift:1
          127.0.0.1	949	Determining replicas for mutation	Thrift:1
          127.0.0.1	3577	Acquiring switchLock read lock	MutationStage:1
          127.0.0.1	3622	Appending to commitlog	MutationStage:1
          127.0.0.1	3754	Adding to test memtable	MutationStage:1
          127.0.0.1	4936	Sending message to /127.0.0.2	WRITE-/127.0.0.2
          127.0.0.2	37	Message received from /127.0.0.1	Thread-4
          127.0.0.2	562	Acquiring switchLock read lock	MutationStage:1
          127.0.0.2	583	Appending to commitlog	MutationStage:1
          127.0.0.1	32	Message received from /127.0.0.2	Thread-7
          127.0.0.2	634	Adding to test memtable	MutationStage:1
          127.0.0.2	1033	Enqueuing response to /127.0.0.1	MutationStage:1
          127.0.0.2	1217	Sending message to /127.0.0.1	WRITE-/127.0.0.1
          127.0.0.1	132	Processing response from /127.0.0.2	RequestResponseStage:3
          

          With a replication factor of 3 it should be writing to all three nodes (127.0.0.1, 127.0.0.2, and 127.0.0.3) - 127.0.0.3 is conspicuously missing from that trace.

          Show
          Ryan McGuire added a comment - This is what one the traces looks like that failed: 127.0.0.1 78 Parsing statement Thrift:1 127.0.0.1 606 Peparing statement Thrift:1 127.0.0.1 949 Determining replicas for mutation Thrift:1 127.0.0.1 3577 Acquiring switchLock read lock MutationStage:1 127.0.0.1 3622 Appending to commitlog MutationStage:1 127.0.0.1 3754 Adding to test memtable MutationStage:1 127.0.0.1 4936 Sending message to /127.0.0.2 WRITE-/127.0.0.2 127.0.0.2 37 Message received from /127.0.0.1 Thread -4 127.0.0.2 562 Acquiring switchLock read lock MutationStage:1 127.0.0.2 583 Appending to commitlog MutationStage:1 127.0.0.1 32 Message received from /127.0.0.2 Thread -7 127.0.0.2 634 Adding to test memtable MutationStage:1 127.0.0.2 1033 Enqueuing response to /127.0.0.1 MutationStage:1 127.0.0.2 1217 Sending message to /127.0.0.1 WRITE-/127.0.0.1 127.0.0.1 132 Processing response from /127.0.0.2 RequestResponseStage:3 With a replication factor of 3 it should be writing to all three nodes (127.0.0.1, 127.0.0.2, and 127.0.0.3) - 127.0.0.3 is conspicuously missing from that trace.
          Hide
          Ryan McGuire added a comment - - edited

          I've tried to introduce sleeps in between INSERTs, even with 30s pauses between them I still see the same in the logs. I have also seen this fail on the very first INSERT.

          Show
          Ryan McGuire added a comment - - edited I've tried to introduce sleeps in between INSERTs, even with 30s pauses between them I still see the same in the logs. I have also seen this fail on the very first INSERT.

            People

            • Assignee:
              Jonathan Ellis
              Reporter:
              Ryan McGuire
              Reviewer:
              Sylvain Lebresne
              Tester:
              Ryan McGuire
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development