Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.6
    • Component/s: Core
    • Labels:
      None

      Description

      We group messages by destination as follows to avoid sending multiple messages to a remote datacenter:

              // Multimap that holds onto all the messages and addresses meant for a specific datacenter
              Map<String, Multimap<Message, InetAddress>> dcMessages
      

      When we cleaned out the MessageProducer stuff for 2.0, this code

                          Multimap<Message, InetAddress> messages = dcMessages.get(dc);
      ...
                          messages.put(producer.getMessage(Gossiper.instance.getVersion(destination)), destination);
      

      turned into

                          Multimap<MessageOut, InetAddress> messages = dcMessages.get(dc);
      ...
                          messages.put(rm.createMessage(), destination);
      

      Thus, we weren't actually grouping anything anymore – each destination replica was stored under a separate Message key, unlike under the old CachingMessageProducer.

      1. 5632.txt
        5 kB
        Jonathan Ellis
      2. cassandra-topology.properties
        0.9 kB
        Hayato Shimizu
      3. fix_patch_bug.log
        4 kB
        Hayato Shimizu
      4. 5632-v2.txt
        9 kB
        Jonathan Ellis

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          Backport of 62f429337caf0aa83b68720a5904e8527b840c80 from 2.0 to fix.

          Show
          Jonathan Ellis added a comment - Backport of 62f429337caf0aa83b68720a5904e8527b840c80 from 2.0 to fix.
          Hide
          Hayato Shimizu added a comment -

          Tested and fixed the issue.

          Show
          Hayato Shimizu added a comment - Tested and fixed the issue.
          Hide
          Hayato Shimizu added a comment - - edited

          The patch fixes the issue of bandwidth-saving.

          However, there seems to be two regressive issues being introduced.

          1. DC2 coordinator selection by the DC1 coordinator is not equal across all available nodes in DC2. Some nodes in DC2 are unused as coordinators.
          2. When using cqlsh, with EACH_QUORUM/ALL, with tracing on, on a row insert, RPC timeout occurs from a node that is not verifiable in the trace output.

          Trace output has been attached for a 6 node cluster, DC1:3, DC2:3 replication factor configuration. network-topology configuration is also attached for clarity.

          Show
          Hayato Shimizu added a comment - - edited The patch fixes the issue of bandwidth-saving. However, there seems to be two regressive issues being introduced. 1. DC2 coordinator selection by the DC1 coordinator is not equal across all available nodes in DC2. Some nodes in DC2 are unused as coordinators. 2. When using cqlsh, with EACH_QUORUM/ALL, with tracing on, on a row insert, RPC timeout occurs from a node that is not verifiable in the trace output. Trace output has been attached for a 6 node cluster, DC1:3, DC2:3 replication factor configuration. network-topology configuration is also attached for clarity.
          Hide
          Jonathan Ellis added a comment -

          Secondary DC coordinator node is always the same node. This introduces a bottleneck in the secondary DC.

          It's the same node for a given token range. When all token ranges are considered, it is evenly spread.

          RPC timeout occurs from a node that is not verifiable in the trace output.

          Well. That's not a very useful error message, is it.

          Show
          Jonathan Ellis added a comment - Secondary DC coordinator node is always the same node. This introduces a bottleneck in the secondary DC. It's the same node for a given token range. When all token ranges are considered, it is evenly spread. RPC timeout occurs from a node that is not verifiable in the trace output. Well. That's not a very useful error message, is it.
          Hide
          Jonathan Ellis added a comment -

          .55 is the forwarding node in DC2. It logs that it applies the mutation and acks it:

              Enqueuing response to /192.168.56.50 | 05:57:33,825 | 192.168.56.55 |          14785
          

          But there is no "Processing response from /192.168.56.55" line logged by .50. Hmm.

          Show
          Jonathan Ellis added a comment - .55 is the forwarding node in DC2. It logs that it applies the mutation and acks it: Enqueuing response to /192.168.56.50 | 05:57:33,825 | 192.168.56.55 | 14785 But there is no "Processing response from /192.168.56.55" line logged by .50. Hmm.
          Hide
          Jonathan Ellis added a comment -

          You're not running with cross_node_timeout enabled, are you? Because some of these clocks are minutes apart.

          # Enable operation timeout information exchange between nodes to accurately
          # measure request timeouts, If disabled cassandra will assuming the request
          # was forwarded to the replica instantly by the coordinator
          #
          # Warning: before enabling this property make sure to ntp is installed
          # and the times are synchronized between the nodes.
          cross_node_timeout: false
          
          Show
          Jonathan Ellis added a comment - You're not running with cross_node_timeout enabled, are you? Because some of these clocks are minutes apart. # Enable operation timeout information exchange between nodes to accurately # measure request timeouts, If disabled cassandra will assuming the request # was forwarded to the replica instantly by the coordinator # # Warning: before enabling this property make sure to ntp is installed # and the times are synchronized between the nodes. cross_node_timeout: false
          Hide
          Jonathan Ellis added a comment - - edited

          I note that .55 doesn't ever log "Sending message" to .50 either. So the message is getting dropped somewhere inside .55's MessagingService.

          cross_node_timeout is my best guess. Next-best guess is that there's a reconnect somehow dropping the message a la CASSANDRA-5393.

          Show
          Jonathan Ellis added a comment - - edited I note that .55 doesn't ever log "Sending message" to .50 either. So the message is getting dropped somewhere inside .55's MessagingService. cross_node_timeout is my best guess. Next-best guess is that there's a reconnect somehow dropping the message a la CASSANDRA-5393 .
          Hide
          Jonathan Ellis added a comment -

          v2 attached that rebases and does some further cleanup to improve trace messages.

          Show
          Jonathan Ellis added a comment - v2 attached that rebases and does some further cleanup to improve trace messages.
          Hide
          Dave Brosius added a comment -

          other than simple FF, +LGTM

          -import org.apache.cassandra.tracing.Tracing;
          +import org.apache.cassandra.tracing.4Tracing;

          Show
          Dave Brosius added a comment - other than simple FF, +LGTM -import org.apache.cassandra.tracing.Tracing; +import org.apache.cassandra.tracing.4Tracing;
          Hide
          Jonathan Ellis added a comment -

          Committed. That should give Hayato an easier way to test at least.

          Show
          Jonathan Ellis added a comment - Committed. That should give Hayato an easier way to test at least.
          Hide
          Hayato Shimizu added a comment -

          It seems that issue 1. in my earlier comment was fixed with 1.2.5 by Yuki (CASSANDRA-5424), where in 1.2.4 NetworkTopologyStrategy.calculateNaturalEndpoints HashSet replicas was changed to LinkedHashSet, so please ignore.

          Show
          Hayato Shimizu added a comment - It seems that issue 1. in my earlier comment was fixed with 1.2.5 by Yuki ( CASSANDRA-5424 ), where in 1.2.4 NetworkTopologyStrategy.calculateNaturalEndpoints HashSet replicas was changed to LinkedHashSet, so please ignore.
          Hide
          Ryan McGuire added a comment -

          I've written a dtest that automates the testing of this issue.

          This test clearly shows that the coordinator was talking to more than one node in a different datacenter, and the patch resolves that issue. It also verifies that Hayato Shimizu's comment about using the same forwarder is not happening now.

          Jonathan Ellis - I noticed in your blog post about tracing you said not to rely on the activity field, well, that's exactly what I'm doing here. So, +1 to the idea of making those enums so this doesn't break in the future.

          Show
          Ryan McGuire added a comment - I've written a dtest that automates the testing of this issue. This test clearly shows that the coordinator was talking to more than one node in a different datacenter, and the patch resolves that issue. It also verifies that Hayato Shimizu 's comment about using the same forwarder is not happening now. Jonathan Ellis - I noticed in your blog post about tracing you said not to rely on the activity field, well, that's exactly what I'm doing here. So, +1 to the idea of making those enums so this doesn't break in the future.
          Hide
          Jonathan Ellis added a comment -

          Thanks, Ryan. Go ahead and create a ticket for that and I'll put my next junior hire on it.

          Show
          Jonathan Ellis added a comment - Thanks, Ryan. Go ahead and create a ticket for that and I'll put my next junior hire on it.
          Hide
          Robert Coli added a comment -

          Do you have an "affects" version for this issue? Description says it started when a re-write for 2.0 started, but it affects 1.2.x so I'm confused?

          Show
          Robert Coli added a comment - Do you have an "affects" version for this issue? Description says it started when a re-write for 2.0 started, but it affects 1.2.x so I'm confused?
          Hide
          Jeremy Hanna added a comment -

          I believe for the issue that was fixed here it originated in 1.2 and was present up through 1.2.5.

          Show
          Jeremy Hanna added a comment - I believe for the issue that was fixed here it originated in 1.2 and was present up through 1.2.5.

            People

            • Assignee:
              Jonathan Ellis
              Reporter:
              Jonathan Ellis
              Reviewer:
              Dave Brosius
              Tester:
              Hayato Shimizu
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development