[GEODE-3003] Geode doesn't start after cluster restart when using cluster-configuration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.0
Component/s: configuration, membership
Labels:
None

Description

There is a two-host Geode cluster with locator and server on each host.
First start of all nodes goes well.
Then all nodes are gracefully stopped (kill [locator-PID] [server-PID]).
The second start goes wrong: the locator on the first host always doesn't join the rest of the cluster with the error in the locator log:
"Region /_ConfigurationRegion has potentially stale data. It is waiting for another member to recover the latest data."

And sometimes (once per 5 starts) some server shuts down just after start with the error
"org.apache.geode.GemFireConfigException: cluster configuration service not available".

This bug started appearing only when we moved to Geode 1.1.1. And it totally blocks us.
On GemFire 8.2.1 there was no such a bug.

This is very easy to reproduce.

Test preparation:
---------------------
Here are two attached zip files - "geode-host1.zip" and "geode-host2.zip"
1) unzip "geode-host1.zip" into some folder on your first host
2) in start-locator.sh change the IPs of locators to the values of your host1 and host2
"--locators=10.50.3.38[20236],10.50.3.14[20236]"
3) in start-server.sh
"locators=10.50.3.38[20236],10.50.3.14[20236]" change the IPs of locators to the values of your host1 and host2
4) do the bullets 1)-3) for host2, the folder where you unzip the file should be the same as on the first host

Test running:
---------------
1) rm -rf

{locator0,server1}

2) run ./start-locator.sh; ./start-server.sh on host1, then on host2. See that this cluster start is successful.
3) kill locator and server processes first on host1, then on host2
kill [locator-PID] [server-PID]
4) run
./start-locator.sh; ./start-server.sh
on host1, then on host2. Make sure the interval between this command run on two hosts is less than 1 second!
5) see via gfsh that actually there are two clusters: "host1-locator" and "host1-server, host2-locator, host2-server" instead of one cluster. And sometimes there is no "host1-server", because it shutdown with the error
"Region /_ConfigurationRegion has potentially stale data. It is waiting for another member to recover the latest data.".

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

readme.txt
29/May/17 10:54
1 kB
Anton Mironenko
geode-host2.zip
29/May/17 10:54
2 kB
Anton Mironenko
geode-host1.zip
29/May/17 10:54
2 kB
Anton Mironenko
20170608-host2-locator0.zip
08/Jun/17 12:47
32 kB
Anton Mironenko
20170608-host1-locator0.zip
08/Jun/17 12:47
32 kB
Anton Mironenko
20170522-geode-vyazma.zip
29/May/17 10:54
239 kB
Anton Mironenko
20170522-geode-klyazma.zip
29/May/17 10:54
219 kB
Anton Mironenko

Issue Links

incorporates

GEODE-3052 Restarting 2 locators within 1s of each other causes potential locator split brain

Closed

relates to

GEODE-2238 Member may fail to receive cluster configuration from locator

Closed

Activity

People

Assignee:: Ken Howe

Reporter:: Anton Mironenko

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/May/17 10:50

Updated:: 09/Apr/18 22:37

Resolved:: 04/Apr/18 21:55