Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-13051

Optimized affinity switch on baseline node left is broken for client topology and MVCC

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 2.9
    • None
    • None
    • Docs Required, Release Notes Required

    Description

      If a node contains only client cache topology with MVCC enabled, PME will hang after changes introduced in IGNITE-12617.

      Reproduced by CachePartitionLossWithRestartsTest.testPartitionLossDetectionOnClientTopology[0 false false -1] and enabled MVCC.

      Attachments

        Issue Links

          Activity

            ascherbakov,
            Seems, ctx.kernalContext().coordinators().mvccEnabled() provides wrong result in this case.
            Do we able to gain mvcc state on client topology somehow?

            avinogradov Anton Vinogradov (Obsolete, actual is "av") added a comment - ascherbakov , Seems, ctx.kernalContext().coordinators().mvccEnabled() provides wrong result in this case. Do we able to gain mvcc state on client topology somehow?

            avinogradov

            I suppose [1] should be set to true if a client topology is created on a node.

            [1] org.apache.ignite.internal.processors.cache.mvcc.MvccProcessorImpl#mvccEnabled

            ascherbakov Alexey Scherbakov added a comment - avinogradov I suppose [1] should be set to true if a client topology is created on a node. [1] org.apache.ignite.internal.processors.cache.mvcc.MvccProcessorImpl#mvccEnabled
            timonin.maksim Maksim Timonin added a comment - - edited

            ascherbakov hi! I've made some research and have a question about the test case [0 false false -1]. In this case:

            1. 4 server nodes started;
            2. 3 of them (id is in 1,2,3) configured with IgniteConfiguration with invoked setCacheConfiguration (default cache);
            3. Node with id=0 configured with IgniteConfiguration without invocation setCacheConfiguration;

            It looks like the id=0 node is filtered by the fact that the cache is not configured at all, not by AttributeNodeFilter. Is it a valid case? 

            Please check the code line, it is not invoked for the id=0 node: CachePartitionLossWithRestartsTest.java#L135

            After configuring the cache on the id=0 node the test case passed.

             

             

             

            timonin.maksim Maksim Timonin added a comment - - edited ascherbakov  hi! I've made some research and have a question about the test case [0 false false -1] . In this case: 4 server nodes started; 3 of them (id is in 1,2,3) configured with IgniteConfiguration with invoked setCacheConfiguration ( default cache); Node with id=0 configured with IgniteConfiguration without  invocation setCacheConfiguration; It looks like the id=0 node is filtered by the fact that the cache is not configured at all, not by AttributeNodeFilter. Is it a valid case?  Please check the code line, it is not invoked for the id=0 node:  CachePartitionLossWithRestartsTest.java#L135 After configuring the cache on the id=0 node the test case passed.      

            timonin.maksim

            This configuration causes cache to be started in client mode on a node0.

            ascherbakov Alexey Scherbakov added a comment - timonin.maksim This configuration causes cache to be started in client mode on a node0.
            timonin.maksim Maksim Timonin added a comment - - edited

            ascherbakov ok, thanks for the clarification.

            gvvinblade hi! could you please check my fix suggestion in PR GitHub Pull Request #7852. Is it a right place to fix? My logic is:

            1. By documentation [1] mvccEnabled is true whether at least one cache with mvcc registered;
            2. Caches are registered in CachesRegistry [2] without applying AttributeNodeFilter;
            3. MVCC flag set to true for new caches in next cases:
              1. preProcessCacheConfiguration: while dynamic starting cache;
              2. preProcessCacheConfiguration: adding cache from configuration;
              3. starting cache after triggering GlobalState to active.
            4. This bug describes a case:
              1. the cache is started on other nodes in MVCC mode;
              2. the cache would not be started on the node0 as it filtered by AttributeNodeFilter;
              3. the cache is not added to configuration for the node0 IgniteConfiguration;
              4. so it leads that cache is registered on the node0, but not started and the MVCC flag is not set. It leads to inconsistency of cache mode on different nodes.

            So I found a place to handle those nodes (PR fix), it already contains a handle initQueryStructuresForNotStartedCache. So it looks like a right place to add the MVCC handle there. But there are some questions:

            1. Is it valid to use validateCacheConfiguration because by docs [3] it validates caches before start but the cache won't be started?
            2. If MVCC enabled for registered caches, maybe it's better to check the MVCC configuration while cache registration process?

             

            [1] MvccProcessorImpl#L216

            [2] CachesRegistry.java#L52

            [3] MvccProcessor.java#L211

            timonin.maksim Maksim Timonin added a comment - - edited ascherbakov  ok, thanks for the clarification. gvvinblade  hi! could you please check my fix suggestion in PR GitHub Pull Request #7852 . Is it a right place to fix? My logic is: By documentation [1]  mvccEnabled is true whether at least one cache with mvcc registered ; Caches are registered in CachesRegistry [2] without applying AttributeNodeFilter; MVCC flag set to true for new caches in next cases: preProcessCacheConfiguration: while dynamic starting cache; preProcessCacheConfiguration: adding cache from configuration; starting cache after triggering GlobalState to active. This bug describes a case: the cache is started on other nodes in MVCC mode; the cache would not be started on the node0 as it filtered by AttributeNodeFilter; the cache is not added to configuration for the node0 IgniteConfiguration; so it leads that cache is registered on the node0, but not started and the MVCC flag is not set. It leads to inconsistency of cache mode on different nodes. So I found a place to handle those nodes ( PR fix ), it already contains a handle initQueryStructuresForNotStartedCache.  So it looks like a right place to add the MVCC handle there. But there are some questions: Is it valid to use validateCacheConfiguration because by docs [3] it validates caches before start but the cache won't be started? If MVCC enabled for registered caches, maybe it's better to check the MVCC configuration while cache registration process?   [1] MvccProcessorImpl#L216 [2] CachesRegistry.java#L52 [3] MvccProcessor.java#L211
            timonin.maksim Maksim Timonin added a comment -

            Discussed with gvvinblade, additional checks are required.

            timonin.maksim Maksim Timonin added a comment - Discussed with gvvinblade , additional checks are required.

            timonin.maksim, Looks good to me, at least I see no issues.

            gvvinblade Igor Seliverstov added a comment - timonin.maksim , Looks good to me, at least I see no issues.
            timonin.maksim Maksim Timonin added a comment - - edited

            gvvinblade hi! thanks for the review!

            ascherbakov hi! Could you please check the PR?

            timonin.maksim Maksim Timonin added a comment - - edited gvvinblade hi! thanks for the review! ascherbakov hi! Could you please check the PR?
            ignitetcbot Ignite TC Bot added a comment -
            Branch: [pull/7852/head] Base: [master] : No blockers found!

            TeamCity --> Run :: All Results

            ignitetcbot Ignite TC Bot added a comment - Branch: [pull/7852/head] Base: [master] : No blockers found! TeamCity --> Run :: All Results

            Merget to master.
            Thanks for your contribution.

            avinogradov Anton Vinogradov (Obsolete, actual is "av") added a comment - Merget to master. Thanks for your contribution.

            avinogradov timonin.maksim

            The fix is incorrect, it works only if a node has joined.

            The test mentioned in the ticket desciption still hangs.

            I've reopened the ticket.

            ascherbakov Alexey Scherbakov added a comment - avinogradov timonin.maksim The fix is incorrect, it works only if a node has joined. The test mentioned in the ticket desciption still hangs. I've reopened the ticket.
            timonin.maksim Maksim Timonin added a comment - - edited

            ascherbakov could you please provide commit id where tests are failing. They're working on my machine with last version of branch ignite-13003 and cherry-picked my commit

            check attachments - there are pictures from my laptop

            timonin.maksim Maksim Timonin added a comment - - edited ascherbakov could you please provide commit id where tests are failing. They're working on my machine with last version of branch ignite-13003 and cherry-picked my commit check attachments - there are pictures from my laptop

            timonin.maksim

            The failing test scenario was mentioned in the ticket description.
            It has to be added to the code base. I've provided a patch reproducer.patch .

            ascherbakov Alexey Scherbakov added a comment - timonin.maksim The failing test scenario was mentioned in the ticket description. It has to be added to the code base. I've provided a patch reproducer.patch .
            timonin.maksim Maksim Timonin added a comment -

            ascherbakov

            fair enough. Miss the setting while final testing. Sorry for that.

            timonin.maksim Maksim Timonin added a comment - ascherbakov fair enough. Miss the setting while final testing. Sorry for that.
            timonin.maksim Maksim Timonin added a comment -

            ascherbakov hi! Could you please check my new PR. I moved validation check to another place that guarantees to cover all possible cases.

            timonin.maksim Maksim Timonin added a comment - ascherbakov  hi! Could you please check my new PR. I moved validation check to another place that guarantees to cover all possible cases.

            timonin.maksim
            The change looks valid, but I see no new bot visa, and aforementioned test scenario still not running.
            I would also add more test scenarios:
            1. activation and subsequent node left with client cache.
            2. new node join with new client cache with subsequent node left.
            3. dynamic cache start with client cache on some nodes with subsequent node left.

            ascherbakov Alexey Scherbakov added a comment - timonin.maksim The change looks valid, but I see no new bot visa, and aforementioned test scenario still not running. I would also add more test scenarios: 1. activation and subsequent node left with client cache. 2. new node join with new client cache with subsequent node left. 3. dynamic cache start with client cache on some nodes with subsequent node left.
            ignitetcbot Ignite TC Bot added a comment -
            Branch: [pull/7902/head] Base: [master] : No blockers found!

            TeamCity --> Run :: All Results

            ignitetcbot Ignite TC Bot added a comment - Branch: [pull/7902/head] Base: [master] : No blockers found! TeamCity --> Run :: All Results
            timonin.maksim Maksim Timonin added a comment -

            ascherbakov hi! Thanks for the suggestions! I added to tests cases you described and some more.

            timonin.maksim Maksim Timonin added a comment - ascherbakov  hi! Thanks for the suggestions! I added to tests cases you described and some more.
            timonin.maksim Maksim Timonin added a comment -

            ascherbakov hi! did you have a chance to check the PR?

            timonin.maksim Maksim Timonin added a comment - ascherbakov  hi! did you have a chance to check the PR?

            timonin.maksim

            1. Added tests do not cover the scenario introduced in testPartitionLossDetectionOnClientTopology causing PME hang. You need to enable free switch exchange to execute problematic code branch (baseline node left).
            2. The fix seems logically incorrect. The method org.apache.ignite.internal.processors.cache.mvcc.MvccProcessor#validateCacheConfiguration is intended to do some validation before cache start according to java doc and throws an exception if validation has failed. After your change validation result is totally ignored and is performed after caches start. Most probably mvcc flag must be initialized elsewhere.
            3. registerCachesFuture must be reverted back to registerCachesFut according to code style.
            ascherbakov Alexey Scherbakov added a comment - timonin.maksim Added tests do not cover the scenario introduced in testPartitionLossDetectionOnClientTopology causing PME hang. You need to enable free switch exchange to execute problematic code branch (baseline node left). The fix seems logically incorrect. The method org.apache.ignite.internal.processors.cache.mvcc.MvccProcessor#validateCacheConfiguration is intended to do some validation before cache start according to java doc and throws an exception if validation has failed. After your change validation result is totally ignored and is performed after caches start. Most probably mvcc flag must be initialized elsewhere. registerCachesFuture must be reverted back to registerCachesFut according to code style.
            timonin.maksim Maksim Timonin added a comment -

            ascherbakov  hi! thanks for comments!

            1. This is not Free Switch Exchange specific bug. The reason is that MVCC flag is not set correctly in 3 places (my test cases cover that):
              1. when client topology is not configured with mvcc CacheConfiguration;
              2. when cluster_state_change event appeared for client topology;
              3. when cache joined to client topology after node already started. 

                      But I added mvcc flag in test that you mention in description.

                 2. Yes, this is a point. The reason I did it is an attempt to fix flag setting for all possible cases. But I think you're correct here and split this change on 3 places as I described above;

                 3.  Unfortunately naming Future was introduced before me, I can't fix without breaking code ownership. I'll remove all my code from this place and somebody could fix the naming in future in any related tasks.
             
             

            timonin.maksim Maksim Timonin added a comment - ascherbakov  hi! thanks for comments! This is not Free Switch Exchange specific bug. The reason is that MVCC flag is not set correctly in 3 places (my test cases cover that): when client topology is not configured with mvcc CacheConfiguration; when cluster_state_change event appeared for client topology; when cache joined to client topology after node already started.            But I added mvcc flag in test that you mention in description.      2. Yes, this is a point. The reason I did it is an attempt to fix flag setting for all possible cases. But I think you're correct here and split this change on 3 places as I described above;      3.  Unfortunately naming Future was introduced before me, I can't fix without breaking code ownership. I'll remove all my code from this place and somebody could fix the naming in future in any related tasks.    
            ignitetcbot Ignite TC Bot added a comment -
            Branch: [pull/7902/head] Base: [master] : No blockers found!

            TeamCity --> Run :: All Results

            ignitetcbot Ignite TC Bot added a comment - Branch: [pull/7902/head] Base: [master] : No blockers found! TeamCity --> Run :: All Results
            timonin.maksim Maksim Timonin added a comment -

            ascherbakov hi! Tests passed on TC (visa is there)

            timonin.maksim Maksim Timonin added a comment - ascherbakov  hi! Tests passed on TC (visa is there)
            timonin.maksim Maksim Timonin added a comment -

            ascherbakov hi! could you please check updated PR?

            timonin.maksim Maksim Timonin added a comment - ascherbakov  hi! could you please check updated PR?
            ascherbakov Alexey Scherbakov added a comment - timonin.maksim LGTM.

            People

              timonin.maksim Maksim Timonin
              ascherbakov Alexey Scherbakov
              Alexey Scherbakov Alexey Scherbakov
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m