Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15700

Performance regression on internode messaging



    • Degradation - Performance Bug/Regression
    • Normal
    • Normal
    • Performance Regression Test
    • All
    • None
    • Hide

      See attached performance test plots. Also see unit tests.

      See attached performance test plots. Also see unit tests.


      Me and jasonstack have been investigating a performance regression affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:

       It's been a bit of a long investigation, but two clues ended up standing out:
      1) An abnormal number of expired messages on 4.0 (as shown in the attached system log), while 3.11 has almost none.
      2) An abnormal GC activity (as shown in the attached gc log).

      Turns out the two are related, as the on expired callback creates a huge amount of strings in the id() call. The next question is what causes all those message expirations; we thoroughly reviewed the internode messaging code and the only issue we could find so far is related to the "batch pruning" calls here and here: it seems too much time is spent on those, causing the event loop to fall behind in processing the rest of the messages, which will end up being expired. This is supported by the analysis of the collapsed stacks (after fixing the GC issue):

      (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
      org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
      org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
      org/apache/cassandra/net/PrunableArrayQueue:prune 1621
      org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
      org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620

      Those are the top 5 sampled calls from EventLoopDelivery#doRun() which spends half of its time pruning. But only a tiny portion of such pruning time is spent actually expiring:

      (tprint (top-aggregated-calls oss40nogc "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
      org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
      org/apache/cassandra/net/PrunableArrayQueue:prune 1894
      org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
      org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
      org/apache/cassandra/net/OutboundConnection:onExpired 147

      And indeed, the PrunableArrayQueue:prune() self time is dominant:

      (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
      org/apache/cassandra/net/PrunableArrayQueue:prune 1718
      org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
      java/util/concurrent/ConcurrentHashMap:replaceNode 19
      java/util/concurrent/ConcurrentLinkedQueue:offer 16
      java/util/concurrent/LinkedBlockingQueue:offer 15

      That said, before proceeding with a PR to fix those issues, I'd like to understand: what's the reason to prune so often, rather than just when polling the message during delivery? If there's a reason I'm missing, let's talk about how to optimize pruning, otherwise let's get rid of that.


        1. Oss40vsOss311.png
          56 kB
          Sergio Bossa
        2. oss40.gc
          593 kB
          Sergio Bossa
        3. oss40_system.log
          1.41 MB
          Sergio Bossa
        4. oss40_nogc.tar.xz
          680 kB
          Sergio Bossa
        5. Oss40patchedvsOss311.png
          42 kB
          Sergio Bossa

        Issue Links



              sbtourist Sergio Bossa
              sbtourist Sergio Bossa
              Sergio Bossa
              Aleksey Yeschenko
              0 Vote for this issue
              11 Start watching this issue



                Time Tracking

                  Original Estimate - Not Specified
                  Not Specified
                  Remaining Estimate - 0h
                  Time Spent - 20m