Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15464

Inserts to set<text> slow due to AtomicBTreePartition for ComplexColumnData.dataSize

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Normal
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: Legacy/Core
    • Labels:
      None
    • Bug Category:
      Degradation - Performance Bug/Regression
    • Severity:
      Normal
    • Complexity:
      Normal
    • Discovered By:
      User Report
    • Platform:
      All
    • Impacts:
      None

      Description

      Concurrent inserts to set<text> can cause client timeouts and excessive CPU due to compare and swap in AtomicBTreePartition for ComplexColumnData.dataSize. As the length of the set gets longer, the probability of doing the compare decreases.

      The problem we saw in production was with insertions into a set<text> with len(set<text>) hundreds to thousands. Because of the semantics of what we store in the set, we had not anticipated the length being more than about 10. (Almost all rows have length <= 6, the largest observed was 7032. Total number of rows < 4000. 3 machines were used.)

      The bad behavior we saw was all machines went to 100% cpu on all cores, and clients were timing out. Our immediate solution in production was adding more machines (went from 3 machines to 6 machines). The stack included partitions.AtomicBTreePartition.addAllWithSizeDelta … ComplexColumnData.dataSize.
      The AtomicBTreePartition code uses a Compare And Swap approach, yet the time between compares is dependent on the length of the set. When the length of the set is long, with concurrent updates, each loop is unlikely to make forward progress and can be delayed looping.

      Here is one example call stack:

      "SharedPool-Worker-40" #167 daemon prio=10 os_prio=0 tid=0x00007f9bb4032800 nid=0x2ee5 runnable [0x00007f9b067f4000]
      java.lang.Thread.State: RUNNABLE
      at org.apache.cassandra.db.rows.ComplexColumnData.dataSize(ComplexColumnData.java:114)
      at org.apache.cassandra.db.rows.BTreeRow.dataSize(BTreeRow.java:373)
      at org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:292)
      at org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:235)
      at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:159)
      at org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:73)
      at org.apache.cassandra.utils.btree.BTree.update(BTree.java:181)
      at org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:155)
      at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
      at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1204)
      at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
      at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:384)
      at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205)
      at org.apache.cassandra.hints.Hint.applyFuture(Hint.java:99)
      at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:95)
      at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
      at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
      at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
      at java.lang.Thread.run(Thread.java:748)
      

      In a test program to repro the problem, we raise the number of concurrent users and lower the think time between queries. Updating elements of low-length sets can occur without errors, and with long-length sets, clients time out with errors and there are periods with all cores 99.x% CPU and with jstack shows time going to  ComplexColumnData.dataSize.

      Here is the schema. Our long term application solution was to just have the set elements be part of the primary key and avoid using set<text>, thus guaranteeing the code does not go through ComplexColumnData.dataSize

      CREATE TABLE x.x (
       x int PRIMARY KEY,
       y set<text> ) ...
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Eric Jacobsen Eric Jacobsen
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: