Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-15700

Performance regression on internode messaging

    XMLWordPrintableJSON

Details

    • Degradation - Performance Bug/Regression
    • Normal
    • Normal
    • Performance Regression Test
    • All
    • None
    • Hide

      See attached performance test plots. Also see unit tests.

      Show
      See attached performance test plots. Also see unit tests.

    Description

      Me and jasonstack have been investigating a performance regression affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:

       It's been a bit of a long investigation, but two clues ended up standing out:
      1) An abnormal number of expired messages on 4.0 (as shown in the attached system log), while 3.11 has almost none.
      2) An abnormal GC activity (as shown in the attached gc log).

      Turns out the two are related, as the on expired callback creates a huge amount of strings in the id() call. The next question is what causes all those message expirations; we thoroughly reviewed the internode messaging code and the only issue we could find so far is related to the "batch pruning" calls here and here: it seems too much time is spent on those, causing the event loop to fall behind in processing the rest of the messages, which will end up being expired. This is supported by the analysis of the collapsed stacks (after fixing the GC issue):

      (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
      org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
      org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
      org/apache/cassandra/net/PrunableArrayQueue:prune 1621
      org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
      org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
      

      Those are the top 5 sampled calls from EventLoopDelivery#doRun() which spends half of its time pruning. But only a tiny portion of such pruning time is spent actually expiring:

      (tprint (top-aggregated-calls oss40nogc "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
      org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
      org/apache/cassandra/net/PrunableArrayQueue:prune 1894
      org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
      org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
      org/apache/cassandra/net/OutboundConnection:onExpired 147
      

      And indeed, the PrunableArrayQueue:prune() self time is dominant:

      (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
      org/apache/cassandra/net/PrunableArrayQueue:prune 1718
      org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
      java/util/concurrent/ConcurrentHashMap:replaceNode 19
      java/util/concurrent/ConcurrentLinkedQueue:offer 16
      java/util/concurrent/LinkedBlockingQueue:offer 15
      

      That said, before proceeding with a PR to fix those issues, I'd like to understand: what's the reason to prune so often, rather than just when polling the message during delivery? If there's a reason I'm missing, let's talk about how to optimize pruning, otherwise let's get rid of that.

      Attachments

        1. Oss40vsOss311.png
          56 kB
          Sergio Bossa
        2. Oss40patchedvsOss311.png
          42 kB
          Sergio Bossa
        3. oss40.gc
          593 kB
          Sergio Bossa
        4. oss40_system.log
          1.41 MB
          Sergio Bossa
        5. oss40_nogc.tar.xz
          680 kB
          Sergio Bossa

        Issue Links

          Activity

            People

              sbtourist Sergio Bossa
              sbtourist Sergio Bossa
              Sergio Bossa
              Aleksey Yeschenko
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m