Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-16506

Flag exception during startup if replica node name does not match zk info

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Blocker
    • Resolution: Won't Fix
    • 9.1
    • None
    • SolrCloud
    • None

    Description

      Description

      We have a scenario which 2 nodes (n1, n2) have under the data folder (`solr_data`) the same core name, both folders have `core.properties` but ONLY n1 has the data folder. And in the state.json for such collection, such core/replica has `node_name` and `base_url` pointing at n1.

      Therefore n1 is the real node hosting the replica, we are not quite sure how we got to such state - could be from some migration failure. We call the replica on n2 the "ghost replica".

      Now if we restart n2, it will actually took over such replica and even deletes the data from n1:

      1. `CoreContainer#load`, calls `CorePropertiesLocator` which finds all the cores hosted on this node by walking through the solr data directory. It finds the ghost core and creates a `CoreDescriptor` for it
      2. `CoreContainer#createFromDescriptor` is invoked to create a `SolrCore` out of the`CoreDescriptor`
      3. `ZkController#preRegister` is called for such `CoreDescription`, which at would publish the replica state as `DOWN`, take note that usually `isPublishAsDownOnStartup` should return false, but in this case it returns `true` as `replica.getNodeName().equals(getNodeName())` is `false`
      4. During `ZkController#publish`, it will publish the state.json with incorrect `base_url` and `node_name` (n2)
      5. Once the state.json is updated with the incorrect values, it triggers`UnloadCoreOnDeleteWatcher`, which unload/delete the core. It will also later publish `DELETECORE` to remove such core from zk

      Solution

      It seems rather risky to update the state.json and publish such replica as down if such core does exist in the state.json but with different node name.

      Instead in `ZkController`, method `preRegister` -> `checkStateInZk`, we should interrupt the core loading if current node name is different from zookeeper state.json's value. Such that it should not attempt to publish DOWN to such replica and update the state.json, which possibly is the wrong node name

       

      Remarks

      With the proposed change, Solr will no longer "auto-correct" the state.json on startup if there's node name mismatch, no sure if that's a desirable behavior though. Some changes are made to unit test case so test restart would not change port number (ie changing the node name)

      Would love to get some input here!

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              patson Patson Luk
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h