Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.7.0
Description
A concourse job failed in DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with two threads stuck in this state:
[vm2] "Pooled Waiting Message Processor 2" tid=0x71 [vm2] java.lang.Thread.State: WAITING [vm2] at java.lang.Object.wait(Native Method) [vm2] - waiting on org.apache.geode.internal.cache.TXCommitMessage@2105ce6 [vm2] at java.lang.Object.wait(Object.java:502) [vm2] at org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176) [vm2] at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160) [vm2] at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144) [vm2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [vm2] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [vm2] at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121) [vm2] at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) [vm2] at org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865) [vm2] at java.lang.Thread.run(Thread.java:748)
I modified the test to tighten up its forcedDisconnect and performOps methods to get transaction recovery to happen more reliably.
public void forceDisconnect() throws Exception { Cache existingCache = basicGetCache(); synchronized(commitLock) { committing = false; while (!committing) { commitLock.wait(); } } if (existingCache != null && !existingCache.isClosed()) { DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem()); } } public void performOps() { Cache cache = getCache(); Region region = cache.getRegion("TestRegion"); DistributedLockService dlockService = DistributedLockService.getServiceNamed("Bulldog"); Random random = new Random(); while (!cache.isClosed()) { boolean locked = false; try { locked = dlockService.lock("testDLock", 500, 60_000); if (!locked) { // this could happen if we're starved out for 30sec by other VMs continue; } cache.getCacheTransactionManager().begin(); region.put("TestKey", "TestValue" + random.nextInt(100000)); TXManagerImpl mgr = (TXManagerImpl) getCache().getCacheTransactionManager(); TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState(); TXState txState = (TXState) txProxy.getRealDeal(null, null); txState.setBeforeSend(() -> { synchronized(commitLock) { committing = true; commitLock.notifyAll(); }}); try { cache.getCacheTransactionManager().commit(); } catch (CommitConflictException e) { throw new RuntimeException("dlock failed to prevent a transaction conflict", e); } int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT); getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1); } catch (CancelException | IllegalStateException e) { // okay to ignore } finally { if (locked) { try { dlockService.unlock("testDLock"); } catch (CancelException | IllegalStateException e) { // shutting down } } } } }
The problem is that the membership listener in TXCommitMessage is removing itself from the transaction map in TXFarSideCMTracker without setting any state that the recovery message can check. The recovery method is waiting like this:
synchronized (this.txInProgress) { mess = (TXCommitMessage) this.txInProgress.get(lk); } if (mess != null) { synchronized (mess) { // tx in progress, we must wait until its done while (!mess.wasProcessed()) { try { mess.wait(); } catch (InterruptedException ie) { Thread.currentThread().interrupt(); logger.error(LocalizedMessage.create( LocalizedStrings.TxFarSideTracker_WAITING_TO_COMPLETE_ON_MESSAGE_0_CAUGHT_AN_INTERRUPTED_EXCEPTION, mess), ie); break; } } }
We could probably change this method to make sure that the message is still in the map instead of only checking wasProcessed().
Attachments
Issue Links
- links to
Commit 1a9ee1f198877d6d715c9563f2d68cfb318ba88b in geode's branch refs/heads/feature/
GEODE-5155from bschuchardt[ https://gitbox.apache.org/repos/asf?p=geode.git;h=1a9ee1f ]
GEODE-5155hang recovering transaction state for crashed serveroops - missed a change