[OAK-3424] ClusterNodeInfo does not pick an existing entry on startup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.25, 1.2.9, 1.3.12
Fix Version/s: 1.2.10, 1.3.13, 1.4
Component/s: core, mongomk, rdbmk
Labels:
None

Description

When the DocumentNodeStore starts up, it attempts to find an entry that matches the current instance (which is defined by something based on network interface address and the current working directory).

However, an additional check is done when the cluster lease end time hasn't been reached, in which case the entry is skipped (assuming it belongs to a different instance), and the scan continues. When no other entry is found, a new one is created.

So why would we ever consider instances with matching instance information to be different? As far as I can tell the answer is: for unit testing.

But...

With the current assignment very weird things can happen, and I believe I see exactly this happening in a customer problem I'm investigating. The sequence is:

1) First system startup, cluster node id 1 is assigned

2) System crashes or was crashed

3) System restarts within the lease time (120s?), a new cluster node id is assigned

4) System shuts down, and gets restarted after a longer interval: cluster id 1 is used again, and system starts MissingLastRevRecovery, despite the previous shutdown having been clean

So what we see is that the system starts up with varying cluster node ids, and recovery processes may run with no correlation to what happened before.

Proposal:

a) Make ClusterNodeInfo.createInstance() much more verbose, so that the default system log contains sufficient information to understand why a certain cluster node id was picked.

b) Drop the logic that skips entries with non-expired leases, so that we get a one-to-one relation between instance ids and cluster node ids. For the unit tests that currently rely on this logic, switch to APIs where the test setup picks the cluster node id.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

OAK-3424.diff
16/Dec/15 16:39
32 kB
Julian Reschke

Issue Links

is blocked by

OAK-3449 DocumentNodeStore support for predefined clusterIds should use ClusterNodeInfos

Closed

is related to

OAK-3418 ClusterNodeInfo uses irrelevant network interface IDs on Windows

Closed

Activity

People

Assignee:: Julian Reschke

Reporter:: Julian Reschke

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Sep/15 15:40

Updated:: 16/Feb/21 09:53

Resolved:: 17/Dec/15 14:04