[IGNITE-10078] Node failure during concurrent partition updates may cause partition desync between primary and backup. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8
Component/s: None
Labels:
None

Ignite Flags:

Docs Required

Description

This is possible if some updates are not written to WAL before node failure. They will be not applied by rebalancing due to same partition counters in certain scenario:

1. Start grid with 3 nodes, 2 backups.
2. Preload some data to partition P.
3. Start two concurrent transactions writing single key to the same partition P, keys are different

try(Transaction tx = client.transactions().txStart(PESSIMISTIC, REPEATABLE_READ, 0, 1)) {
      client.cache(DEFAULT_CACHE_NAME).put(k, v);

      tx.commit();
}

4. Order updates on backup in the way such update with max partition counter is written to WAL and update with lesser partition counter failed due to triggering of FH before it's added to WAL

5. Return failed node to grid, observe no rebalancing due to same partition counters.

Possible solution: detect gaps in update counters on recovery and force rebalance from a node without gaps if detected.

Attachments

Issue Links

causes

IGNITE-11867 Fix flaky test GridCacheRebalancingWithAsyncClearingTest#testCorrectRebalancingCurrentlyRentingPartitions

Resolved

IGNITE-11908 OOM in MVCC PDS4

Resolved

IGNITE-11857 Investigate performance drop after IGNITE-10078

Resolved

is related to

IGNITE-10603 MVCC: Inconsistent partition state after recovery.

Resolved

is required by

IGNITE-11794 Remove initial counter from update counter contract.

Open

IGNITE-11797 Fix consistency issues for atomic and mixed tx-atomic cache groups.

Resolved

IGNITE-11611 If partition consistency cannot be restored during rebalance using counters the most recent partition data should be used.

Open

IGNITE-11793 Improve isolated updater mode.

Open

IGNITE-11820 Add persistence to IgniteCacheGroupTest

Open

IGNITE-11799 Do not always clear partition in MOVING state before exchange

Resolved

relates to

IGNITE-11800 Update counters in o.a.i.i.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl#update could be applied from stale messages

Open

IGNITE-11801 Clearing of moving partition may lead to partition desync.

Open

IGNITE-12694 A possible partition desync if last supplier has left and returned later.

Open

IGNITE-11704 Write tombstones during rebalance to get rid of deferred delete buffer

Open

IGNITE-11790 Optimize rebalance history calculation.

Open

IGNITE-11791 Fix IgnitePdsContinuousRestartTestWithExpiryPolicy

Open

IGNITE-11887 Add more test scenarious for OWNING -> RENTING -> MOVING scenario

Open

IGNITE-11147 Re-balance cancellation occur by non-affected event

Resolved

links to

GitHub Pull Request #5765

(5 is required by, 8 relates to, 2 links to)

Activity

People

Assignee:: Alexey Scherbakov

Reporter:: Alexey Scherbakov

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 31/Oct/18 07:25

Updated:: 18/Feb/20 10:46

Resolved:: 22/May/19 16:26

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: