Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Degradation - Other Exception
-
Normal
-
Low Hanging Fruit
-
User Report
-
All
-
Description
On one of our clusters, we noticed rare but periodic ArrayIndexOutOfBoundsExceptions:
message="Uncaught exception on thread Thread[ReadStage-3,5,main]" exception="java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2579) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.ArrayIndexOutOfBoundsException"
The error was in a Runnable, so the stacktrace didn't directly indicate where the error was coming from. We enabled JFR to log the underlying exception that was thrown:
message="Uncaught exception on thread Thread[ReadStage-2,5,main]" exception="java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 0 at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2579) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 0 at java.base/java.util.ArrayList.add(ArrayList.java:487) at java.base/java.util.ArrayList.add(ArrayList.java:499) at org.apache.cassandra.service.ClientWarn$State.add(ClientWarn.java:84) at org.apache.cassandra.service.ClientWarn$State.access$000(ClientWarn.java:77) at org.apache.cassandra.service.ClientWarn.warn(ClientWarn.java:51) at org.apache.cassandra.db.ReadCommand$1MetricRecording.onClose(ReadCommand.java:596) at org.apache.cassandra.db.transform.BasePartitions.runOnClose(BasePartitions.java:70) at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:95) at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:2260) at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2575) ... 6 more"
An AIOBE on ArrayList.add(E) should only be possible when multiple threads attempt to call the method at the same time.
This was seen while executing a SELECT WHERE IN query with multiple partition keys. This exception could happen when multiple local reads are dispatched by the coordinator in org.apache.cassandra.service.reads.AbstractReadExecutor#makeRequests. In this case, multiple local reads exceed the tombstone warning threshold, so multiple tombstone warnings are added to the same ClientWarn.State reference. Currently, org.apache.cassandra.service.ClientWarn.State#warnings is an ArrayList, which isn't safe for concurrent modification, causing the AIOBE to be thrown.
I have a patch available for this, and I'm preparing it now. The patch is simple - it just changes org.apache.cassandra.service.ClientWarn.State#warnings to a thread-safe CopyOnWriteArrayList. I also have a jvm-dtest that demonstrates the issue but doesn't need to be merged - it shows how a SELECT WHERE IN query with local reads that add client warnings can add to the same ClientWarn.State from different threads. I'll push that in a separate branch just for demonstration purposes.
Demonstration branch: https://github.com/apache/cassandra/compare/trunk...aratno:cassandra:CASSANDRA-19427-aiobe-clientwarn-demo
Fix branch: https://github.com/apache/cassandra/compare/trunk...aratno:cassandra:CASSANDRA-19427-aiobe-clientwarn-fix (PR linked below)
This appears to have been an issue since at least 3.11, that was the earliest release I checked.