Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
It is possible for a client to miss events from subscription (either CQ or register interest) due to the following scenario:
Four servers in a cluster, with redundant copies set to 2 for client subscriptions. The client has its primary subscription endpoint with server 1 and redundant copies are on servers 2 and 3. Server 2 is killed or lost due to network partition, so we attempt to restore redundancy by copying the client queue from server 3 to server 4.
Two things happen when server 4 gets the client queue from server 3. First, we request the client's filter info which represents the CQ and register interest info. Second, we actually perform the GII to get the image of the queue.
A race can occur where an event is being distributed across the cluster concurrently while server 4 is initializing the client queue. If the distributed event is processed by server 4 before the filter info is retrieved, then the event will not match the client subscription filter because it doesn't exist yet. Then, if the event is processed by server 3 after GII has started, the event will not be part of the client queue image. Therefore, the event is never added to the client queue and is lost.
We have a special queue for handling events while a client is initializing, but it is at too low of a level (MessageDispatcher) to be able to handle this scenario. One possible solution is moving this special queue to a higher level (CacheClientNotifier or CacheClientProxy) so the event is queued before we even attempt to get filter info. Then, when initialization finishes, we drain the queue, see if it matches the initialized client's filter, and send it along if so. A similar solution could be done on the GII provider side but it might be a bit messier.