Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
Discovery Commons 1.0.20
-
None
Description
Invoking a ClusterSyncService.sync() should ensure that if a background thread is running, it gets terminated. In the case of a chain of sync services implemented by ClusterSyncServiceChain there is a race-condition which can leak a background thread running unintentionally.
The order of events is eg:
- setup: ClusterSyncServiceChain is configured with a chain of a OakBacklogClusterSyncService followed by a SyncTokenService
- ViewStateManagerImpl.handleNewViewNonDelayed can be called consecutively, as it is protected by a lock. Let's assume on first invocation it starts a sync process by invoking the ClusterSyncServiceChain.sync()
- the ClusterSyncServiceChain.sync() will first invoke the OakBacklogClusterSyncService.sync()
- let's assume there is a backlog (ie a recovery hasn't finished yet) with a particular instance in the cluster - so OakBacklogClusterSyncService.sync() will spawn a background thread that checks for that backlog every 2 seconds (handled by AbstractServiceWithBackgroundCheck.BackgroundCheckRunnable.run()).
- the OakBacklogClusterSyncService background check determines that the backlog is gone and that it can proceed. It will do so by continuing in the ClusterSyncServiceChain - ie it will invoke the SyncTokenService.sync() - that in turn invokes syncToken() - which invokes startBackgroundThread. At this point, let's assume, the check() returns false - ie there are sync tokens it needs to wait for. Hence it will start a new background Thread for the SyncTokenService.
- let's now assume, at some later point ViewStateManagerImpl.handleNewViewNonDelayed is again invoked. So it will again call ClusterSyncServiceChain.sync(). Now it is the responsibility of ClusterSyncServiceChain.sync() to cancel any currently running background thread - but the first element in the chain (the OakBacklogClusterSyncService) doesn't actually have any background thread anymore - it has already "handed over" to the SyncTokenService - so the OakBacklogClusterSyncService.cancelPreviousBackgroundCheck() will not have anything to cancel. Hence it goes ahead and follows the sync() chain process - which is first checking for backlog (that has cleared now, as we assumed), and then the sync token (which is not complete, as we assumed). So it will now create a second background thread.
The above resulting in 2 SyncTokenService background threads.
This can have a few consequences:
- a behaviour was noticed, whereby SyncTokenService.storeMySyncToken keeps being invoked with toggling values, eg first it stores 627, then 628, then again 627, then 628 etc. The following is an example of the generated data in DocumentNodeStore as a result ( excerpt from the "_id" : "4:/var/discovery/oak/syncTokens" document ) :
"<slingId>" : { "r179378d10da-0-4" : "\"627\"", "r179378d14ce-0-4" : "\"628\"", "r179378d18bc-0-4" : "\"627\"", "r179378d1ca1-0-4" : "\"628\"", "r179378d209c-0-4" : "\"627\"", "r179378d2477-0-4" : "\"628\"", "r179378d287f-0-4" : "\"627\"", "r179378d2c4b-0-4" : "\"628\"", "r179378d3068-0-4" : "\"627\"", "r179378d3421-0-4" : "\"628\"" },
This eventually should clear out - but it can lead to
- unnecessary lengthening the sync procedure as other cluster nodes will potentially have to wait longer
- confusion when reading such log files
- it can also result in conflict exceptions if the two threads want to update simultaneously:
04.05.2021 13:22:27.741 *ERROR* [Discovery-AsyncEventSender] org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService storeMySyncToken: got PersistenceException while storing my syncToken: org.apache.sling.api.resource.PersistenceException: Unable to commit changes to session. org.apache.sling.api.resource.PersistenceException: Unable to commit changes to session. at org.apache.sling.jcr.resource.internal.helper.jcr.JcrResourceProvider.commit(JcrResourceProvider.java:516) [org.apache.sling.jcr.resource:3.0.22] at org.apache.sling.resourceresolver.impl.providers.stateful.AuthenticatedResourceProvider.commit(AuthenticatedResourceProvider.java:215) [org.apache.sling.resourceresolver:1.7.2] at org.apache.sling.resourceresolver.impl.helper.ResourceResolverControl.commit(ResourceResolverControl.java:424) [org.apache.sling.resourceresolver:1.7.2] at org.apache.sling.resourceresolver.impl.ResourceResolverImpl.commit(ResourceResolverImpl.java:1000) [org.apache.sling.resourceresolver:1.7.2] at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.storeMySyncToken(SyncTokenService.java:174) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.access$000(SyncTokenService.java:53) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService$1.check(SyncTokenService.java:137) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck.startBackgroundCheck(AbstractServiceWithBackgroundCheck.java:199) [org.apache.sling.discovery.commons: 1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.syncToken(SyncTokenService.java:131) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.sync(SyncTokenService.java:124) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.chainedSync(ClusterSyncServiceChain.java:65) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.access$000(ClusterSyncServiceChain.java:34) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain$1.run(ClusterSyncServiceChain.java:69) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck.startBackgroundCheck(AbstractServiceWithBackgroundCheck.java:204) [org.apache.sling.discovery.commons: 1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.OakBacklogClusterSyncService.waitWhileBacklog(OakBacklogClusterSyncService.java:142) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.OakBacklogClusterSyncService.sync(OakBacklogClusterSyncService.java:136) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.chainedSync(ClusterSyncServiceChain.java:65) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.sync(ClusterSyncServiceChain.java:54) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.base.ViewStateManagerImpl$1.trigger(ViewStateManagerImpl.java:548) [org.apache.sling.discovery.commons:1.0.20] at org.apache.sling.discovery.commons.providers.base.AsyncEventSender.run(AsyncEventSender.java:118) [org.apache.sling.discovery.commons:1.0.20] at java.base/java.lang.Thread.run(Thread.java:834) Caused by: javax.jcr.InvalidItemStateException: OakState0001: Unresolved conflicts in /var/discovery/oak/syncTokens at org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:238) ...
- the 2 threads can also be seen in a thread-dump of course:
"SyncTokenService" #504 daemon prio=5 os_prio=0 cpu=258.15ms elapsed=47.01s tid=0x000000000f315800 nid=0x449c in Object.wait() [0x00007f74fd56c000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(java.base@11.0.6/Native Method) - waiting on <no object reference available> at org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck$BackgroundCheckRunnable.run(AbstractServiceWithBackgroundCheck.java:98) - waiting to re-lock in wait() <0x00000007a8516420> (a java.lang.Object) at java.lang.Thread.run(java.base@11.0.6/Thread.java:834) "SyncTokenService" #530 daemon prio=5 os_prio=0 cpu=0.78ms elapsed=0.96s tid=0x000000000a638800 nid=0x4870 in Object.wait() [0x00007f74f0bdf000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(java.base@11.0.6/Native Method) - waiting on <no object reference available> at org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck$BackgroundCheckRunnable.run(AbstractServiceWithBackgroundCheck.java:98) - waiting to re-lock in wait() <0x00000007bfe29640> (a java.lang.Object) at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)