Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-4928

Using a distributed lock to wrap a transaction will not prevent a CommitConflictException if members fail

Details

    Description

      I wrote a function that obtained a dlock and then performed a transaction.  It always operates on the same dlock key and the same keys in my region.  That protects against getting a commit conflict exception BUT this sometimes fails if the JVM holding the lock crashes.  One of the functions appears to get the dlock okay but then its transaction fails when it goes to commit.

      Attachments

        Issue Links

          Activity

            Commit d8e30a0de6c8742dde53d9203cd122e9bc710849 in geode's branch refs/heads/feature/GEODE-4928 from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d8e30a0 ]

            GEODE-4928

            addressing review comment

            jira-bot ASF subversion and git services added a comment - Commit d8e30a0de6c8742dde53d9203cd122e9bc710849 in geode's branch refs/heads/feature/ GEODE-4928 from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=d8e30a0 ] GEODE-4928 addressing review comment

            Commit f3b47a5a8ba65fbcf0915e94da3ef3683962c43a in geode's branch refs/heads/develop from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=f3b47a5 ]

            GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized

            The lock service cleans up transaction locks in a background thread but
            cleans up regular dlocks in its membership listener. This allows you to
            get a dlock as soon as a member holding that lock leaves the distributed
            system, but its transaction locks stick around for a while.

            The fix for this is to wait for the background processing to complete before
            acquiring locks and checking for conflicts.

            jira-bot ASF subversion and git services added a comment - Commit f3b47a5a8ba65fbcf0915e94da3ef3683962c43a in geode's branch refs/heads/develop from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=f3b47a5 ] GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized The lock service cleans up transaction locks in a background thread but cleans up regular dlocks in its membership listener. This allows you to get a dlock as soon as a member holding that lock leaves the distributed system, but its transaction locks stick around for a while. The fix for this is to wait for the background processing to complete before acquiring locks and checking for conflicts.

            this is fixed in develop but needs to be ported to the 1.5 release branch

            bschuchardt Bruce J Schuchardt added a comment - this is fixed in develop but needs to be ported to the 1.5 release branch

            Commit 3c65d9b296569cc2a769187881ab14c6fe3c9a93 in geode's branch refs/heads/feature/GEODE-4928b from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=3c65d9b ]

            GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized

            Moved setting of the in-process state to a point before we launch the
            background thread to do the cleanup. Added in-line cleanup with a warning
            message if background execution is rejected and we aren't shutting down.

            If we are shutting down the in-process state is left being set to true
            so that lockers won't get a spurious conflict exception from the grantor
            and will retry their commits with a new grantor when one is created.

            jira-bot ASF subversion and git services added a comment - Commit 3c65d9b296569cc2a769187881ab14c6fe3c9a93 in geode's branch refs/heads/feature/ GEODE-4928 b from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=3c65d9b ] GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized Moved setting of the in-process state to a point before we launch the background thread to do the cleanup. Added in-line cleanup with a warning message if background execution is rejected and we aren't shutting down. If we are shutting down the in-process state is left being set to true so that lockers won't get a spurious conflict exception from the grantor and will retry their commits with a new grantor when one is created.

            Commit 0ffb983821f50416d84ea3247cba70bb337438fe in geode's branch refs/heads/develop from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=0ffb983 ]

            Revert "GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized"

            Reverting - this caused CI to hang.

            This reverts commit f3b47a5a8ba65fbcf0915e94da3ef3683962c43a.

            jira-bot ASF subversion and git services added a comment - Commit 0ffb983821f50416d84ea3247cba70bb337438fe in geode's branch refs/heads/develop from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=0ffb983 ] Revert " GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized" Reverting - this caused CI to hang. This reverts commit f3b47a5a8ba65fbcf0915e94da3ef3683962c43a.

            Commit 9a350df0a0cbae2d222df9fa99faf5fd24a66fca in geode's branch refs/heads/feature/GEODE-4928c from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=9a350df ]

            GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized

            Wait for background transaction cleanup when there is a failure.

            jira-bot ASF subversion and git services added a comment - Commit 9a350df0a0cbae2d222df9fa99faf5fd24a66fca in geode's branch refs/heads/feature/ GEODE-4928 c from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=9a350df ] GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized Wait for background transaction cleanup when there is a failure.

            Commit b2f774363e53a4b522e3019afbb38ea5b72dcaf7 in geode's branch refs/heads/develop from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=b2f7743 ]

            GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized

            Wait for background transaction cleanup when there is a failure.

            jira-bot ASF subversion and git services added a comment - Commit b2f774363e53a4b522e3019afbb38ea5b72dcaf7 in geode's branch refs/heads/develop from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=b2f7743 ] GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized Wait for background transaction cleanup when there is a failure.

            Commit 020e5b56beba4a0dbd5a0529e39d4acb567fe52f in geode's branch refs/heads/develop from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=020e5b5 ]

            GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized

            backing out test to fix connection issues

            jira-bot ASF subversion and git services added a comment - Commit 020e5b56beba4a0dbd5a0529e39d4acb567fe52f in geode's branch refs/heads/develop from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=020e5b5 ] GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized backing out test to fix connection issues

            Commit 4bdd31cb322d65a2361e1dd4de2cbdee2a3caf60 in geode's branch refs/heads/develop from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4bdd31c ]

            GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized

            Fixing an issue with the test

            This closes #1680

            jira-bot ASF subversion and git services added a comment - Commit 4bdd31cb322d65a2361e1dd4de2cbdee2a3caf60 in geode's branch refs/heads/develop from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4bdd31c ] GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized Fixing an issue with the test This closes #1680

            Commit 020e5b56beba4a0dbd5a0529e39d4acb567fe52f in geode's branch refs/heads/feature/GEODE-4909 from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=020e5b5 ]

            GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized

            backing out test to fix connection issues

            jira-bot ASF subversion and git services added a comment - Commit 020e5b56beba4a0dbd5a0529e39d4acb567fe52f in geode's branch refs/heads/feature/ GEODE-4909 from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=020e5b5 ] GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized backing out test to fix connection issues

            Commit 4bdd31cb322d65a2361e1dd4de2cbdee2a3caf60 in geode's branch refs/heads/feature/GEODE-4909 from bschuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4bdd31c ]

            GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized

            Fixing an issue with the test

            This closes #1680

            jira-bot ASF subversion and git services added a comment - Commit 4bdd31cb322d65a2361e1dd4de2cbdee2a3caf60 in geode's branch refs/heads/feature/ GEODE-4909 from bschuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=4bdd31c ] GEODE-4928 DistributedLockService doesn't work as expected while the dlock grantor is initialized Fixing an issue with the test This closes #1680
            upthewaterspout Dan Smith added a comment -

            The fix attempted for this doesn't quite work, and here's why:

            When a member/client leaves, the change causes both the transaction lock and DLock to clean up whatever locks that member had. DLocks clean up rather quickly, whereas transaction lock cleanup happens asynchronously in the background. The fix was to have the transaction lock grantor set some state letting the DLock grantor know that the transaction lock is cleaning up locks, and have the DLock grantor wait. However, the DLock grantor can get beyond that check without the Transaction grantor having notified, meaning that the window is reduced but not closed.

            In addition to the above problem, if the lock grantors are actually different members, making the dlock grantor wait on the transaction lock grantor in the same process doesn't help.

            The test written for this issue is demonstrating these issues sporadically, see GEODE-5470.

            I don't think we ever really designed these two services to work together that way, so I think this would be a feature to be implemented, not a bug to be fixed. Maybe we could have DLock and Transaction locks synchronize on a view number, rather than a boolean flag, but that requires new messages and significant reworking of how views are processed. Maybe we could mark certain DLock services as having to be synchronized with transaction locks?

            upthewaterspout Dan Smith added a comment - The fix attempted for this doesn't quite work, and here's why: When a member/client leaves, the change causes both the transaction lock and DLock to clean up whatever locks that member had. DLocks clean up rather quickly, whereas transaction lock cleanup happens asynchronously in the background. The fix was to have the transaction lock grantor set some state letting the DLock grantor know that the transaction lock is cleaning up locks, and have the DLock grantor wait. However, the DLock grantor can get beyond that check without the Transaction grantor having notified, meaning that the window is reduced but not closed. In addition to the above problem, if the lock grantors are actually different members, making the dlock grantor wait on the transaction lock grantor in the same process doesn't help. The test written for this issue is demonstrating these issues sporadically, see GEODE-5470 . I don't think we ever really designed these two services to work together that way, so I think this would be a feature to be implemented, not a bug to be fixed. Maybe we could have DLock and Transaction locks synchronize on a view number, rather than a boolean flag, but that requires new messages and significant reworking of how views are processed. Maybe we could mark certain DLock services as having to be synchronized with transaction locks?

            Commit fcd6940600bc0ad3593a492c9a3a30bef4a9bf76 in geode's branch refs/heads/develop from Dan Smith
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=fcd6940 ]

            GEODE-5470: Deleting flaky DlockAndTxLockRegressionTest

            This test was introduced as part of the fix for GEODE-4928. However,
            we've determined that GEODE-4928 is not actually fixed, which is why
            this test is sporadically failing.

            Since GEODE-4928 actually appears to be trying to introduce new behavior
            for dlocks that is a rather involved feature to implement. Leaving that
            feature for another time and deleting this test.

            jira-bot ASF subversion and git services added a comment - Commit fcd6940600bc0ad3593a492c9a3a30bef4a9bf76 in geode's branch refs/heads/develop from Dan Smith [ https://gitbox.apache.org/repos/asf?p=geode.git;h=fcd6940 ] GEODE-5470 : Deleting flaky DlockAndTxLockRegressionTest This test was introduced as part of the fix for GEODE-4928 . However, we've determined that GEODE-4928 is not actually fixed, which is why this test is sporadically failing. Since GEODE-4928 actually appears to be trying to introduce new behavior for dlocks that is a rather involved feature to implement. Leaving that feature for another time and deleting this test.

            Commit fcd6940600bc0ad3593a492c9a3a30bef4a9bf76 in geode's branch refs/heads/develop from Dan Smith
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=fcd6940 ]

            GEODE-5470: Deleting flaky DlockAndTxLockRegressionTest

            This test was introduced as part of the fix for GEODE-4928. However,
            we've determined that GEODE-4928 is not actually fixed, which is why
            this test is sporadically failing.

            Since GEODE-4928 actually appears to be trying to introduce new behavior
            for dlocks that is a rather involved feature to implement. Leaving that
            feature for another time and deleting this test.

            jira-bot ASF subversion and git services added a comment - Commit fcd6940600bc0ad3593a492c9a3a30bef4a9bf76 in geode's branch refs/heads/develop from Dan Smith [ https://gitbox.apache.org/repos/asf?p=geode.git;h=fcd6940 ] GEODE-5470 : Deleting flaky DlockAndTxLockRegressionTest This test was introduced as part of the fix for GEODE-4928 . However, we've determined that GEODE-4928 is not actually fixed, which is why this test is sporadically failing. Since GEODE-4928 actually appears to be trying to introduce new behavior for dlocks that is a rather involved feature to implement. Leaving that feature for another time and deleting this test.

            Commit fcd6940600bc0ad3593a492c9a3a30bef4a9bf76 in geode's branch refs/heads/develop from Dan Smith
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=fcd6940 ]

            GEODE-5470: Deleting flaky DlockAndTxLockRegressionTest

            This test was introduced as part of the fix for GEODE-4928. However,
            we've determined that GEODE-4928 is not actually fixed, which is why
            this test is sporadically failing.

            Since GEODE-4928 actually appears to be trying to introduce new behavior
            for dlocks that is a rather involved feature to implement. Leaving that
            feature for another time and deleting this test.

            jira-bot ASF subversion and git services added a comment - Commit fcd6940600bc0ad3593a492c9a3a30bef4a9bf76 in geode's branch refs/heads/develop from Dan Smith [ https://gitbox.apache.org/repos/asf?p=geode.git;h=fcd6940 ] GEODE-5470 : Deleting flaky DlockAndTxLockRegressionTest This test was introduced as part of the fix for GEODE-4928 . However, we've determined that GEODE-4928 is not actually fixed, which is why this test is sporadically failing. Since GEODE-4928 actually appears to be trying to introduce new behavior for dlocks that is a rather involved feature to implement. Leaving that feature for another time and deleting this test.

            People

              Unassigned Unassigned
              bschuchardt Bruce J Schuchardt
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h