Solr
  1. Solr
  2. SOLR-4663

Log an error if more than one core points to the same data dir.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.3, 5.0
    • Fix Version/s: 4.3, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      In large multi-core setups, having mistakes whereby two or more cores point to the same data dir seems quite possible. We should at least complain very loudly in the logs if this is detected.

      Should be a very straightforward check at core discovery time.

      Is this serious enough to keep Solr from coming up at all?

      1. SOLR-4663.patch
        48 kB
        Erick Erickson
      2. SOLR-4663.patch
        46 kB
        Erick Erickson
      3. SOLR-4663.patch
        28 kB
        Erick Erickson
      4. SOLR-4663.patch
        16 kB
        Erick Erickson

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.
          Hide
          Erick Erickson added a comment -

          4x: r - 1466319

          Show
          Erick Erickson added a comment - 4x: r - 1466319
          Hide
          Mark Miller added a comment -

          I'm okay with preventing that in non solrcloud mode as well.

          Show
          Mark Miller added a comment - I'm okay with preventing that in non solrcloud mode as well.
          Hide
          Erick Erickson added a comment -

          Last version, nothing really changed just cleanup.

          Mark Miller

          In CoreAdminHandler, there's a bit of code specific to reporting same-named cores iff zkAware is set. I took out the ZK-specific parts and wanted to ping you in case that was a bad move. I also changed the error returned as per the discussion with Hoss.

          trunk r: 1466291

          Show
          Erick Erickson added a comment - Last version, nothing really changed just cleanup. Mark Miller In CoreAdminHandler, there's a bit of code specific to reporting same-named cores iff zkAware is set. I took out the ZK-specific parts and wanted to ping you in case that was a bad move. I also changed the error returned as per the discussion with Hoss. trunk r: 1466291
          Hide
          Erick Erickson added a comment -

          I'll commit this later today unless there are objections, and assuming all the tests pass (running nightly now).

          Show
          Erick Erickson added a comment - I'll commit this later today unless there are objections, and assuming all the tests pass (running nightly now).
          Hide
          Erick Erickson added a comment -

          All tests run with this patch, putting up for comment.

          Still have to deal with Yonik's and Hoss's suggestions about recording core creation errors rather than hard-failing.

          But the major change here is that creating cores via core-admin should correctly persist to solr.xml.

          Persisting is an incredible pain. I'll be Sooooooo happy when we obsolete that nonsense.

          Show
          Erick Erickson added a comment - All tests run with this patch, putting up for comment. Still have to deal with Yonik's and Hoss's suggestions about recording core creation errors rather than hard-failing. But the major change here is that creating cores via core-admin should correctly persist to solr.xml. Persisting is an incredible pain. I'll be Sooooooo happy when we obsolete that nonsense.
          Hide
          Hoss Man added a comment -

          Minor Tangent...

          ...This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message,...

          After posting my last comment, it occured to me that even though we are tracking core init failures and reporting them in STATUS requests, we are still returning 404s when people attempt to use those cores ... i've opened SOLR-4672 to consier returning a 500 wraped around the init error.

          Show
          Hoss Man added a comment - Minor Tangent... ...This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message,... After posting my last comment, it occured to me that even though we are tracking core init failures and reporting them in STATUS requests, we are still returning 404s when people attempt to use those cores ... i've opened SOLR-4672 to consier returning a 500 wraped around the init error.
          Hide
          Erick Erickson added a comment -

          ~hossman Didn't know that, that sounds like it pretty much solves the issue. I'll look over the patch, but any tips on what "properly recorded" means (I have to run out now, so I'm being lazy, but I can look later.

          Erick

          Show
          Erick Erickson added a comment - ~hossman Didn't know that, that sounds like it pretty much solves the issue. I'll look over the patch, but any tips on what "properly recorded" means (I have to run out now, so I'm being lazy, but I can look later. Erick
          Hide
          Hoss Man added a comment -

          ... The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error.

          This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. ...

          Don't forget that CoreContainer now has the ability to track and report (as part of "STATUS" requests) that cores failed to initialize (either on startup or via API calls) and why. (see SOLR-3591 and child tasks)

          This type of dataDir error should be no different: if coreA loads fine, and then coreB fails to load because it points at the same data dir as coreA that doesn't need to prevent the server from functioning, coreA should still be usable, and as long as the error is properly recorded in the CoreContainer the UI and the CoreAdminHandler will report back why coreB isn't available.

          Show
          Hoss Man added a comment - ... The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error. This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. ... Don't forget that CoreContainer now has the ability to track and report (as part of "STATUS" requests) that cores failed to initialize (either on startup or via API calls) and why. (see SOLR-3591 and child tasks) This type of dataDir error should be no different: if coreA loads fine, and then coreB fails to load because it points at the same data dir as coreA that doesn't need to prevent the server from functioning, coreA should still be usable, and as long as the error is properly recorded in the CoreContainer the UI and the CoreAdminHandler will report back why coreB isn't available.
          Hide
          Erick Erickson added a comment -

          bq: creating a core via API

          This patch doesn't address this, at least not yet. Good point though, I'll see about adding that in.

          bq: If someone messes with a core's configuration such that it fails on startup...

          You always make my life more complicated <G>.... I can argue that either way, is this a fail-fast situation or is the robustness of driving on as well as possible worth the added difficulty in tracking down? The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error.

          This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. That would avoid the complexity of checking every time we tried to open a core. The important bit is to fail without screwing up the index files for the offending cores (two cores pointing to the same dataDir). I expect that the current behavior for, say two cores with the same name is that we use one or the other, perhaps not consistently. It's Not Right to do anything at all IMO.

          I think in the discovery case, though, the chance of copying the same core.properties file to multiple places, thus having cores with the same name or data dir is much more likely....

          So do you actively object to failing fast? Or are you ok with failing fast but your comments are intended to make sure we're considering all the angles? Initially I didn't want to make things more complicated, but by just not putting the offending cores in the relevant lists I think the complexity argument goes away and preserving index integrity is maintained so I'm +/-0.

          Let me know...

          Show
          Erick Erickson added a comment - bq: creating a core via API This patch doesn't address this, at least not yet. Good point though, I'll see about adding that in. bq: If someone messes with a core's configuration such that it fails on startup... You always make my life more complicated <G>.... I can argue that either way, is this a fail-fast situation or is the robustness of driving on as well as possible worth the added difficulty in tracking down? The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error. This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. That would avoid the complexity of checking every time we tried to open a core. The important bit is to fail without screwing up the index files for the offending cores (two cores pointing to the same dataDir). I expect that the current behavior for, say two cores with the same name is that we use one or the other, perhaps not consistently. It's Not Right to do anything at all IMO. I think in the discovery case, though, the chance of copying the same core.properties file to multiple places, thus having cores with the same name or data dir is much more likely.... So do you actively object to failing fast? Or are you ok with failing fast but your comments are intended to make sure we're considering all the angles? Initially I didn't want to make things more complicated, but by just not putting the offending cores in the relevant lists I think the complexity argument goes away and preserving index integrity is maintained so I'm +/-0. Let me know...
          Hide
          Yonik Seeley added a comment -

          Think about creating a core via API... pointing at another core's data directory should only only cause the create of that new core to fail.
          Likewise, if someone messes with a core's configuration such that it fails on startup, it doesn't seem like this should stop the other correctly configured cores from loading.

          Show
          Yonik Seeley added a comment - Think about creating a core via API... pointing at another core's data directory should only only cause the create of that new core to fail. Likewise, if someone messes with a core's configuration such that it fails on startup, it doesn't seem like this should stop the other correctly configured cores from loading.
          Hide
          Erick Erickson added a comment -

          Here's a patch. It fixes another issue I saw with persisting (I'll be soooo glad when that's obsoleted!).

          It fails hard when either a core with a duplicate name is found or with a duplicate data dir, both with old and new style solr.xml. It's still possible for more than one core to share the same instance dir. I'd guess there are other checks that could be done, but this is a start.

          My real question is whether it's appropriate to fail hard on startup or just log a very loud warning. Ryan's the only person who's weighed in on this so far. My preference is to fail hard but I'm not particularly invested either way.

          I'll be checking this in tomorrow at the latest unless there are objections.

          Show
          Erick Erickson added a comment - Here's a patch. It fixes another issue I saw with persisting (I'll be soooo glad when that's obsoleted!). It fails hard when either a core with a duplicate name is found or with a duplicate data dir, both with old and new style solr.xml. It's still possible for more than one core to share the same instance dir. I'd guess there are other checks that could be done, but this is a start. My real question is whether it's appropriate to fail hard on startup or just log a very loud warning. Ryan's the only person who's weighed in on this so far. My preference is to fail hard but I'm not particularly invested either way. I'll be checking this in tomorrow at the latest unless there are objections.
          Hide
          Ryan Ernst added a comment -

          +1 to failing to startup in this serious misconfiguration.

          Show
          Ryan Ernst added a comment - +1 to failing to startup in this serious misconfiguration.

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Erick Erickson
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development