Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-5155

hang recovering transaction state for crashed server

Details

    Description

      A concourse job failed in DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with two threads stuck in this state:

      [vm2] "Pooled Waiting Message Processor 2" tid=0x71
      [vm2]     java.lang.Thread.State: WAITING
      [vm2] 	at java.lang.Object.wait(Native Method)
      [vm2] 	-  waiting on org.apache.geode.internal.cache.TXCommitMessage@2105ce6
      [vm2] 	at java.lang.Object.wait(Object.java:502)
      [vm2] 	at org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
      [vm2] 	at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
      [vm2] 	at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
      [vm2] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      [vm2] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      [vm2] 	at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
      [vm2] 	at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
      [vm2] 	at org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
      [vm2] 	at java.lang.Thread.run(Thread.java:748)
      

      I modified the test to tighten up its forcedDisconnect and performOps methods to get transaction recovery to happen more reliably.

        public void forceDisconnect() throws Exception {
          Cache existingCache = basicGetCache();
          synchronized(commitLock) {
            committing = false;
            while (!committing) {
              commitLock.wait();
            }
          }
          if (existingCache != null && !existingCache.isClosed()) {
            DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
          }
        }
      
      
        public void performOps() {
          Cache cache = getCache();
          Region region = cache.getRegion("TestRegion");
          DistributedLockService dlockService = DistributedLockService.getServiceNamed("Bulldog");
          Random random = new Random();
      
          while (!cache.isClosed()) {
            boolean locked = false;
            try {
              locked = dlockService.lock("testDLock", 500, 60_000);
              if (!locked) {
                // this could happen if we're starved out for 30sec by other VMs
                continue;
              }
      
              cache.getCacheTransactionManager().begin();
      
              region.put("TestKey", "TestValue" + random.nextInt(100000));
      
              TXManagerImpl mgr = (TXManagerImpl) getCache().getCacheTransactionManager();
              TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
              TXState txState = (TXState) txProxy.getRealDeal(null, null);
              txState.setBeforeSend(() -> {
                synchronized(commitLock) {
                  committing = true;
                  commitLock.notifyAll();
                }});
      
              try {
                cache.getCacheTransactionManager().commit();
              } catch (CommitConflictException e) {
                throw new RuntimeException("dlock failed to prevent a transaction conflict", e);
              }
      
              int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
              getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
      
            } catch (CancelException | IllegalStateException e) {
              // okay to ignore
            } finally {
              if (locked) {
                try {
                  dlockService.unlock("testDLock");
                } catch (CancelException | IllegalStateException e) {
                  // shutting down
                }
              }
            }
          }
        }
      

      The problem is that the membership listener in TXCommitMessage is removing itself from the transaction map in TXFarSideCMTracker without setting any state that the recovery message can check. The recovery method is waiting like this:

          synchronized (this.txInProgress) {
            mess = (TXCommitMessage) this.txInProgress.get(lk);
          }
          if (mess != null) {
            synchronized (mess) {
              // tx in progress, we must wait until its done
              while (!mess.wasProcessed()) {
                try {
                  mess.wait();
                } catch (InterruptedException ie) {
                  Thread.currentThread().interrupt();
                  logger.error(LocalizedMessage.create(
                      LocalizedStrings.TxFarSideTracker_WAITING_TO_COMPLETE_ON_MESSAGE_0_CAUGHT_AN_INTERRUPTED_EXCEPTION,
                      mess), ie);
                  break;
                }
              }
            }
      

      We could probably change this method to make sure that the message is still in the map instead of only checking wasProcessed().

      Attachments

        Issue Links

          Activity

            Commit 1a9ee1f198877d6d715c9563f2d68cfb318ba88b in geode's branch refs/heads/feature/GEODE-5155 from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=1a9ee1f ]

            GEODE-5155 hang recovering transaction state for crashed server

            oops - missed a change

            jira-bot ASF subversion and git services added a comment - Commit 1a9ee1f198877d6d715c9563f2d68cfb318ba88b in geode's branch refs/heads/feature/ GEODE-5155 from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=1a9ee1f ] GEODE-5155 hang recovering transaction state for crashed server oops - missed a change

            Commit e1688b6ab99fb4abca5be8e260eb0380eba33695 in geode's branch refs/heads/feature/GEODE-5155 from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=e1688b6 ]

            GEODE-5155 hang recovering transaction state for crashed server

            renaming variables to make the code more readable

            jira-bot ASF subversion and git services added a comment - Commit e1688b6ab99fb4abca5be8e260eb0380eba33695 in geode's branch refs/heads/feature/ GEODE-5155 from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=e1688b6 ] GEODE-5155 hang recovering transaction state for crashed server renaming variables to make the code more readable

            Commit 2a02923a2fe3ead102ff79e76c47935e84f76859 in geode's branch refs/heads/develop from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=2a02923 ]

            GEODE-5155 hang recovering transaction state for crashed server

            The fix is to check to see if the message was removed due to a
            memberDeparted event.

            When that happens a departureNoticed flag was being set in TXCommitMessage
            but the wait loop in the transaction tracker wasn't checking this flag.

            jira-bot ASF subversion and git services added a comment - Commit 2a02923a2fe3ead102ff79e76c47935e84f76859 in geode's branch refs/heads/develop from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=2a02923 ] GEODE-5155 hang recovering transaction state for crashed server The fix is to check to see if the message was removed due to a memberDeparted event. When that happens a departureNoticed flag was being set in TXCommitMessage but the wait loop in the transaction tracker wasn't checking this flag.

            Commit 2c1b8a4edd99c3b5d25697a08b917de3310c31ae in geode's branch refs/heads/develop from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=2c1b8a4 ]

            GEODE-5155 hang recovering transaction state for crashed server

            fixing an NPE caused by the test creating a new cache in between
            starting a transaction and installing a test hook.

            jira-bot ASF subversion and git services added a comment - Commit 2c1b8a4edd99c3b5d25697a08b917de3310c31ae in geode's branch refs/heads/develop from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=2c1b8a4 ] GEODE-5155 hang recovering transaction state for crashed server fixing an NPE caused by the test creating a new cache in between starting a transaction and installing a test hook.

            People

              bschuchardt Bruce J Schuchardt
              bschuchardt Bruce J Schuchardt
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m