Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-20141

Unresponsive node after ingesting large amounts of vectors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Triage Needed
    • Normal
    • Resolution: Unresolved
    • None
    • None
    • None
    • All
    • None

    Description

      Background:

      We have a Cassandra 5.0.2 cluster running on java 17, we've tried with everything from 3 to 23 nodes (running in AWS on r7i.4xlarge instances)

      We have a table with an id column of type TEXT and another column of type VECTOR <FLOAT, 256>.

      On that table we also have an SAI index on the VECTOR column with the options { 'similarity_function': 'EUCLIDEAN' }

      When:

      When we ingest large amounts of embeddings (~200 million) we notice each and every time that before all embeddings are saved a node becomes unresponsive (after >20 million are ingested) and no other node is unable to rejoin the cluster.

      If the index is removed before we ingest the data, everything is able to be properly persisted, but once the index is added (and created successfully) the same thing happens again once we continue writing more embeddings to the cluster

      What:

      We saw the following stacktrace in our logs:

      java.lang.NullPointerException: Cannot invoke "java.lang.Boolean.booleanValue()" because "res" is null
          at org.apache.cassandra.utils.memory.MemtableCleanerThread$Clean.apply(MemtableCleanerThread.java:97)
          at org.apache.cassandra.utils.concurrent.ListenerList$CallbackBiConsumerListener.run(ListenerList.java:244)
          at org.apache.cassandra.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:140)
          at org.apache.cassandra.utils.concurrent.ListenerList.safeExecute(ListenerList.java:166)
          at org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:157)
          at org.apache.cassandra.utils.concurrent.ListenerList$CallbackBiConsumerListener.notifySelf(ListenerList.java:250)
          at org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
          at org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
          at org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
          at org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
          at org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
          at org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
          at org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139)
          at org.apache.cassandra.db.memtable.AbstractAllocatorMemtable.lambda$flushLargestMemtable$0(AbstractAllocatorMemtable.java:306)
          at org.apache.cassandra.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:140)
          at org.apache.cassandra.utils.concurrent.ListenerList.safeExecute(ListenerList.java:166)
          at org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:157)
          at org.apache.cassandra.utils.concurrent.ListenerList$RunnableWithExecutor.notifySelf(ListenerList.java:345)
          at org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
          at org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
          at org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
          at org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
          at org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
          at org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
          at org.apache.cassandra.concurrent.FutureTask.tryFailure(FutureTask.java:87)
          at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:75)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.base/java.lang.Thread.run(Thread.java:840)
      

      This leads me to believe the above NPE happens once the Memtables are to be cleaned (persisted as SSTables?) perhaps?

      Attachments

        Activity

          People

            Unassigned Unassigned
            robknu Robert Knutsson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: