[OAK-3436] Prevent missing checkpoint due to unstable topology from causing complete reindexing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.32, 1.2.10, 1.3.13, 1.4
Component/s: query
Labels:
- resilience

Epic Link:
indexer resilience 1.6

Description

Async indexing logic relies on embedding application to ensure that async indexing job is run as a singleton in a cluster. For Sling based apps it depends on Sling Discovery support. At times it is being seen that if topology is not stable then different cluster nodes can consider them as leader and execute the async indexing job concurrently.

This can cause problem as both cluster node might not see same repository state (due to write skew and eventual consistency) and might remove the checkpoint which other cluster node is still relying upon. For e.g. consider a 2 node cluster N1 and N2 where both are performing async indexing.

Base state - CP1 is the checkpoint for "async" job
N2 starts indexing and removes changes CP1 to CP2. For Mongo the checkpoints are saved in settings collection
N1 also decides to execute indexing but has yet not seen the latest repository state so still thinks that CP1 is the base checkpoint and tries to read it. However CP1 is already removed from settings and this makes N1 think that checkpoint is missing and it decides to reindex everything!

To avoid this topology must be stable but at Oak level we should still handle such a case and avoid doing a full reindexing. So we would need to have a MissingCheckpointStrategy similar to MissingIndexEditorStrategy as done in ~~OAK-2203~~

Possible approaches

A1 - Fail the indexing run if checkpoint is missing - Checkpoint being missing can have valid reason and invalid reason. Need to see what are valid scenarios where a checkpoint can go missing
A2 - When a checkpoint is created also store the creation time. When a checkpoint is found to be missing and its a recent checkpoint then fail the run. For e.g. we would fail the run till checkpoint found to be missing is less than an hour old (for just started take startup time into account)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

AsyncIndexUpdateClusterTest.java
24/Sep/15 15:41
8 kB
Chetan Mehrotra
OAK-3436-0.patch
25/Sep/15 10:10
15 kB
Chetan Mehrotra
OAK-3436-part2.patch
08/Dec/15 14:17
4 kB
Alex Deparvu
OAK-3436-part2-v2.patch
09/Dec/15 09:04
4 kB
Alex Deparvu
OAK-3436-tests.patch
07/Dec/15 12:35
23 kB
Alex Deparvu
OAK-3436-v2.patch
07/Dec/15 17:04
26 kB
Alex Deparvu

Issue Links

is blocked by

OAK-1648 Creating multiple checkpoint on same head revision overwrites previous entries

Closed

is duplicated by

OAK-3810 Log messages related to AsyncIndexUpdate leaseTimeOut impact

Resolved

relates to

OAK-2961 Async index fails with OakState0001: Unresolved conflicts in /:async

Closed

OAK-3891 AsyncIndexUpdateLeaseTest doesn't use the provided NodeStore

Closed

Activity

People

Assignee:: Alex Deparvu

Reporter:: Chetan Mehrotra

Votes:: 15 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 22/Sep/15 04:49

Updated:: 04/Jul/16 10:58

Resolved:: 18/Dec/15 16:10