Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-8385

hang recovering from disk with cyclic dependencies

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      In a test cluster using replicated persistent Regions all of the servers were shut down and restarted.  The restart hung showing a cycle in disk store dependencies.

       

      [info 2020/05/29 03:02:36.635 PDT <Thread-18> tid=0x8f] Region /Region_14 has potentially stale data. It is waiting for another online member to recover the latest data.My persistent id:
        DiskStore ID: a175354a-d27d-4575-9916-16fd7ff7ea67  Name: persistgemfire4_host1_4194  Location: /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_5_persist4_disk_1
      Members with potentially new data:[  
      DiskStore ID: 2d77752e-507d-4425-a382-a5856c61938f  Name: persistgemfire10_host1_4208  Location: /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_2_persist10_disk_1]
      Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
      

      After looking at the logs for all members, the "members with potentially new data" for each member were found to be:

      Member | Members with potentially new data
      --------+----------------------------------
      1 | all
      2 | 4
      3 | 4
      4 | 10
      5 | 2, 3, 4, 8, 10
      6 | 2, 3, 4, 5, 7, 8, 10
      7 | 3, 4, 10
      8 | 3, 4, 10
      9 | 2, 3, 4, 5, 7, 8, 10
      10 | 3
      

      It appears that there is a cycle in this "waiting for another online member" graph between 3 > 4 > 10 > 3.

      The problem seems to have cropped up after the fix for GEODE-7196 was merged. That changed the timing of member-departed notifications such that a server might close a Region's Persistence Advisor before getting notification that another server was shutting down. We used to do this notification upon receipt of a ShutdownMessage but now we only do it when the membership view has changed.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              bschuchardt Bruce J Schuchardt
              Reporter:
              bschuchardt Bruce J Schuchardt

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment