Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Correctness
-
Normal
-
Normal
-
Unit Test
-
All
-
None
-
Description
This can be reproduced by running LongBufferPoolTest, 1 out 5 runs..
java.lang.NullPointerException at org.apache.cassandra.utils.memory.BufferPool$Chunk.access$1300(BufferPool.java:836) at org.apache.cassandra.utils.memory.BufferPool$LocalPool.lambda$remove$1(BufferPool.java:716) at org.apache.cassandra.utils.memory.BufferPool$MicroQueueOfChunks.removeIf(BufferPool.java:460) at org.apache.cassandra.utils.memory.BufferPool$MicroQueueOfChunks.access$1500(BufferPool.java:304) at org.apache.cassandra.utils.memory.BufferPool$LocalPool.remove(BufferPool.java:716) at org.apache.cassandra.utils.memory.BufferPool$LocalPool.put(BufferPool.java:590) at org.apache.cassandra.utils.memory.BufferPool$LocalPool.recycle(BufferPool.java:709) at org.apache.cassandra.utils.memory.BufferPool$Chunk.recycle(BufferPool.java:909) at org.apache.cassandra.utils.memory.BufferPool$Chunk.tryRecycle(BufferPool.java:903) at org.apache.cassandra.utils.memory.BufferPool$Chunk.release(BufferPool.java:896) at org.apache.cassandra.utils.memory.BufferPool$MicroQueueOfChunks.removeIf(BufferPool.java:465) at org.apache.cassandra.utils.memory.BufferPool$MicroQueueOfChunks.access$1500(BufferPool.java:304) at org.apache.cassandra.utils.memory.BufferPool$LocalPool.addChunk(BufferPool.java:736) at org.apache.cassandra.utils.memory.BufferPool$LocalPool.addChunkFromParent(BufferPool.java:725) at org.apache.cassandra.utils.memory.BufferPool$LocalPool.tryGetInternal(BufferPool.java:691) at org.apache.cassandra.utils.memory.BufferPool$LocalPool.tryGet(BufferPool.java:679) at org.apache.cassandra.utils.memory.BufferPool$LocalPool.access$000(BufferPool.java:518) at org.apache.cassandra.utils.memory.BufferPool.tryGet(BufferPool.java:120) at org.apache.cassandra.utils.memory.LongBufferPoolTest$2.testOne(LongBufferPoolTest.java:497) at org.apache.cassandra.utils.memory.LongBufferPoolTest$TestUntil.call(LongBufferPoolTest.java:558) at org.apache.cassandra.utils.memory.LongBufferPoolTest$TestUntil.call(LongBufferPoolTest.java:538) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)
The cause is that:
- When evicting a normal chunk from a full MicroQueueOfChunks, local pool will try to remove corresponding tiny chunks, via MicroQueueOfChunks#removeIf.
- If matching tiny chunk is found, tiny chunk.release() is called immediately before moving null chunk to the back of the queue.
- Due to concurrent release from different threads, tiny chunk.release() may cause its parent normal chunk, aka. the evicted chunk in #1, to be removed from local pool and causes tiny pool to remove corresponding tiny chunks again in LocalPool#remove().
- In MicroQueueOfChunks#removeIf, due to previous in-progress removeIf, it throws NPE as it violate MicroQueueOfChunks's assumption which requires null chunks to be put at the back of queue.
The fix is to put null chunks to the back of queue before releasing any chunks.