Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
In a test cluster using replicated persistent Regions all of the servers were shut down and restarted. The restart hung showing a cycle in disk store dependencies.
[info 2020/05/29 03:02:36.635 PDT <Thread-18> tid=0x8f] Region /Region_14 has potentially stale data. It is waiting for another online member to recover the latest data.My persistent id: DiskStore ID: a175354a-d27d-4575-9916-16fd7ff7ea67 Name: persistgemfire4_host1_4194 Location: /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_5_persist4_disk_1 Members with potentially new data:[ DiskStore ID: 2d77752e-507d-4425-a382-a5856c61938f Name: persistgemfire10_host1_4208 Location: /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_2_persist10_disk_1] Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
After looking at the logs for all members, the "members with potentially new data" for each member were found to be:
Member | Members with potentially new data --------+---------------------------------- 1 | all 2 | 4 3 | 4 4 | 10 5 | 2, 3, 4, 8, 10 6 | 2, 3, 4, 5, 7, 8, 10 7 | 3, 4, 10 8 | 3, 4, 10 9 | 2, 3, 4, 5, 7, 8, 10 10 | 3
It appears that there is a cycle in this "waiting for another online member" graph between 3 > 4 > 10 > 3.
The problem seems to have cropped up after the fix for GEODE-7196 was merged. That changed the timing of member-departed notifications such that a server might close a Region's Persistence Advisor before getting notification that another server was shutting down. We used to do this notification upon receipt of a ShutdownMessage but now we only do it when the membership view has changed.
Attachments
Issue Links
- relates to
-
GEODE-8477 getMembersNotShuttingDown() doesn't use consistent set of shutdown members
- Open
- links to