Details
Description
Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in expensive traffic from the application side. This surge involved a large volume of costly read queries, which took a considerable amount of time to process on the server side. The client had timeout settings; if a request timed out, it might trigger the sending of new requests. Since the server nodes were overloaded, numerous nodes had hundreds of thousands of tasks queued in the Native-Transport-Request pending queue. I expected that once the application ceased sending requests, the server node would quickly return to normal, as most requests in the queue were over half an hour old and should have timed out rapidly, clearing the queue. However, it actually took an hour to clear the native transport's pending queue, even with native transport disabled. Upon examining the code, I noticed that for read/write requests, the queryStartNanoTime, which determines if a request has timed out, only begins when the task starts processing. This means that no matter how long a request has been pending, it doesn't contribute to the timeout. I believe this is incorrect. The timer should start when the Cassandra server receives the request or when it enqueues the task, not when the request/task begins processing. This way, an overloaded node with many pending tasks can quickly discard timed-out requests and recover from an outage once new requests stop.
Attachments
Attachments
Issue Links
- is superceded by
-
CASSANDRA-19534 Unbounded queues in native transport requests lead to node instability
- Resolved
- links to