Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-3003

Geode doesn't start after cluster restart when using cluster-configuration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.5.0
    • configuration, membership
    • None

    Description

      There is a two-host Geode cluster with locator and server on each host.
      First start of all nodes goes well.
      Then all nodes are gracefully stopped (kill [locator-PID] [server-PID]).
      The second start goes wrong: the locator on the first host always doesn't join the rest of the cluster with the error in the locator log:
      "Region /_ConfigurationRegion has potentially stale data. It is waiting for another member to recover the latest data."

      And sometimes (once per 5 starts) some server shuts down just after start with the error
      "org.apache.geode.GemFireConfigException: cluster configuration service not available".

      This bug started appearing only when we moved to Geode 1.1.1. And it totally blocks us.
      On GemFire 8.2.1 there was no such a bug.

      This is very easy to reproduce.

      Test preparation:
      ---------------------
      Here are two attached zip files - "geode-host1.zip" and "geode-host2.zip"
      1) unzip "geode-host1.zip" into some folder on your first host
      2) in start-locator.sh change the IPs of locators to the values of your host1 and host2
      "--locators=10.50.3.38[20236],10.50.3.14[20236]"
      3) in start-server.sh
      "locators=10.50.3.38[20236],10.50.3.14[20236]" change the IPs of locators to the values of your host1 and host2
      4) do the bullets 1)-3) for host2, the folder where you unzip the file should be the same as on the first host

      Test running:
      ---------------
      1) rm -rf

      {locator0,server1}

      2) run ./start-locator.sh; ./start-server.sh on host1, then on host2. See that this cluster start is successful.
      3) kill locator and server processes first on host1, then on host2
      kill [locator-PID] [server-PID]
      4) run
      ./start-locator.sh; ./start-server.sh
      on host1, then on host2. Make sure the interval between this command run on two hosts is less than 1 second!
      5) see via gfsh that actually there are two clusters: "host1-locator" and "host1-server, host2-locator, host2-server" instead of one cluster. And sometimes there is no "host1-server", because it shutdown with the error
      "Region /_ConfigurationRegion has potentially stale data. It is waiting for another member to recover the latest data.".

      Attachments

        1. readme.txt
          1 kB
          Anton Mironenko
        2. geode-host2.zip
          2 kB
          Anton Mironenko
        3. geode-host1.zip
          2 kB
          Anton Mironenko
        4. 20170608-host2-locator0.zip
          32 kB
          Anton Mironenko
        5. 20170608-host1-locator0.zip
          32 kB
          Anton Mironenko
        6. 20170522-geode-vyazma.zip
          239 kB
          Anton Mironenko
        7. 20170522-geode-klyazma.zip
          219 kB
          Anton Mironenko

        Issue Links

          Activity

            People

              khowe Ken Howe
              Neighbour Anton Mironenko
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: