Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4663

Log an error if more than one core points to the same data dir.

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 4.3, 6.0
    • 4.3, 6.0
    • None
    • None

    Description

      In large multi-core setups, having mistakes whereby two or more cores point to the same data dir seems quite possible. We should at least complain very loudly in the logs if this is detected.

      Should be a very straightforward check at core discovery time.

      Is this serious enough to keep Solr from coming up at all?

      Attachments

        1. SOLR-4663.patch
          48 kB
          Erick Erickson
        2. SOLR-4663.patch
          46 kB
          Erick Erickson
        3. SOLR-4663.patch
          28 kB
          Erick Erickson
        4. SOLR-4663.patch
          16 kB
          Erick Erickson

        Issue Links

          Activity

            rjernst Ryan Ernst added a comment -

            +1 to failing to startup in this serious misconfiguration.

            rjernst Ryan Ernst added a comment - +1 to failing to startup in this serious misconfiguration.
            erickerickson Erick Erickson added a comment -

            Here's a patch. It fixes another issue I saw with persisting (I'll be soooo glad when that's obsoleted!).

            It fails hard when either a core with a duplicate name is found or with a duplicate data dir, both with old and new style solr.xml. It's still possible for more than one core to share the same instance dir. I'd guess there are other checks that could be done, but this is a start.

            My real question is whether it's appropriate to fail hard on startup or just log a very loud warning. Ryan's the only person who's weighed in on this so far. My preference is to fail hard but I'm not particularly invested either way.

            I'll be checking this in tomorrow at the latest unless there are objections.

            erickerickson Erick Erickson added a comment - Here's a patch. It fixes another issue I saw with persisting (I'll be soooo glad when that's obsoleted!). It fails hard when either a core with a duplicate name is found or with a duplicate data dir, both with old and new style solr.xml. It's still possible for more than one core to share the same instance dir. I'd guess there are other checks that could be done, but this is a start. My real question is whether it's appropriate to fail hard on startup or just log a very loud warning. Ryan's the only person who's weighed in on this so far. My preference is to fail hard but I'm not particularly invested either way. I'll be checking this in tomorrow at the latest unless there are objections.
            yseeley@gmail.com Yonik Seeley added a comment -

            Think about creating a core via API... pointing at another core's data directory should only only cause the create of that new core to fail.
            Likewise, if someone messes with a core's configuration such that it fails on startup, it doesn't seem like this should stop the other correctly configured cores from loading.

            yseeley@gmail.com Yonik Seeley added a comment - Think about creating a core via API... pointing at another core's data directory should only only cause the create of that new core to fail. Likewise, if someone messes with a core's configuration such that it fails on startup, it doesn't seem like this should stop the other correctly configured cores from loading.
            erickerickson Erick Erickson added a comment -

            bq: creating a core via API

            This patch doesn't address this, at least not yet. Good point though, I'll see about adding that in.

            bq: If someone messes with a core's configuration such that it fails on startup...

            You always make my life more complicated <G>.... I can argue that either way, is this a fail-fast situation or is the robustness of driving on as well as possible worth the added difficulty in tracking down? The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error.

            This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. That would avoid the complexity of checking every time we tried to open a core. The important bit is to fail without screwing up the index files for the offending cores (two cores pointing to the same dataDir). I expect that the current behavior for, say two cores with the same name is that we use one or the other, perhaps not consistently. It's Not Right to do anything at all IMO.

            I think in the discovery case, though, the chance of copying the same core.properties file to multiple places, thus having cores with the same name or data dir is much more likely....

            So do you actively object to failing fast? Or are you ok with failing fast but your comments are intended to make sure we're considering all the angles? Initially I didn't want to make things more complicated, but by just not putting the offending cores in the relevant lists I think the complexity argument goes away and preserving index integrity is maintained so I'm +/-0.

            Let me know...

            erickerickson Erick Erickson added a comment - bq: creating a core via API This patch doesn't address this, at least not yet. Good point though, I'll see about adding that in. bq: If someone messes with a core's configuration such that it fails on startup... You always make my life more complicated <G>.... I can argue that either way, is this a fail-fast situation or is the robustness of driving on as well as possible worth the added difficulty in tracking down? The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error. This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. That would avoid the complexity of checking every time we tried to open a core. The important bit is to fail without screwing up the index files for the offending cores (two cores pointing to the same dataDir). I expect that the current behavior for, say two cores with the same name is that we use one or the other, perhaps not consistently. It's Not Right to do anything at all IMO. I think in the discovery case, though, the chance of copying the same core.properties file to multiple places, thus having cores with the same name or data dir is much more likely.... So do you actively object to failing fast? Or are you ok with failing fast but your comments are intended to make sure we're considering all the angles? Initially I didn't want to make things more complicated, but by just not putting the offending cores in the relevant lists I think the complexity argument goes away and preserving index integrity is maintained so I'm +/-0. Let me know...

            ... The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error.

            This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. ...

            Don't forget that CoreContainer now has the ability to track and report (as part of "STATUS" requests) that cores failed to initialize (either on startup or via API calls) and why. (see SOLR-3591 and child tasks)

            This type of dataDir error should be no different: if coreA loads fine, and then coreB fails to load because it points at the same data dir as coreA that doesn't need to prevent the server from functioning, coreA should still be usable, and as long as the error is properly recorded in the CoreContainer the UI and the CoreAdminHandler will report back why coreB isn't available.

            hossman Chris M. Hostetter added a comment - ... The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error. This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. ... Don't forget that CoreContainer now has the ability to track and report (as part of "STATUS" requests) that cores failed to initialize (either on startup or via API calls) and why. (see SOLR-3591 and child tasks) This type of dataDir error should be no different: if coreA loads fine, and then coreB fails to load because it points at the same data dir as coreA that doesn't need to prevent the server from functioning, coreA should still be usable, and as long as the error is properly recorded in the CoreContainer the UI and the CoreAdminHandler will report back why coreB isn't available.
            erickerickson Erick Erickson added a comment -

            ~hossman Didn't know that, that sounds like it pretty much solves the issue. I'll look over the patch, but any tips on what "properly recorded" means (I have to run out now, so I'm being lazy, but I can look later.

            Erick

            erickerickson Erick Erickson added a comment - ~hossman Didn't know that, that sounds like it pretty much solves the issue. I'll look over the patch, but any tips on what "properly recorded" means (I have to run out now, so I'm being lazy, but I can look later. Erick

            Minor Tangent...

            ...This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message,...

            After posting my last comment, it occured to me that even though we are tracking core init failures and reporting them in STATUS requests, we are still returning 404s when people attempt to use those cores ... i've opened SOLR-4672 to consier returning a 500 wraped around the init error.

            hossman Chris M. Hostetter added a comment - Minor Tangent... ...This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message,... After posting my last comment, it occured to me that even though we are tracking core init failures and reporting them in STATUS requests, we are still returning 404s when people attempt to use those cores ... i've opened SOLR-4672 to consier returning a 500 wraped around the init error.
            erickerickson Erick Erickson added a comment -

            All tests run with this patch, putting up for comment.

            Still have to deal with Yonik's and Hoss's suggestions about recording core creation errors rather than hard-failing.

            But the major change here is that creating cores via core-admin should correctly persist to solr.xml.

            Persisting is an incredible pain. I'll be Sooooooo happy when we obsolete that nonsense.

            erickerickson Erick Erickson added a comment - All tests run with this patch, putting up for comment. Still have to deal with Yonik's and Hoss's suggestions about recording core creation errors rather than hard-failing. But the major change here is that creating cores via core-admin should correctly persist to solr.xml. Persisting is an incredible pain. I'll be Sooooooo happy when we obsolete that nonsense.
            erickerickson Erick Erickson added a comment -

            I'll commit this later today unless there are objections, and assuming all the tests pass (running nightly now).

            erickerickson Erick Erickson added a comment - I'll commit this later today unless there are objections, and assuming all the tests pass (running nightly now).
            erickerickson Erick Erickson added a comment -

            Last version, nothing really changed just cleanup.

            markrmiller@gmail.com

            In CoreAdminHandler, there's a bit of code specific to reporting same-named cores iff zkAware is set. I took out the ZK-specific parts and wanted to ping you in case that was a bad move. I also changed the error returned as per the discussion with Hoss.

            trunk r: 1466291

            erickerickson Erick Erickson added a comment - Last version, nothing really changed just cleanup. markrmiller@gmail.com In CoreAdminHandler, there's a bit of code specific to reporting same-named cores iff zkAware is set. I took out the ZK-specific parts and wanted to ping you in case that was a bad move. I also changed the error returned as per the discussion with Hoss. trunk r: 1466291
            markrmiller@gmail.com Mark Miller added a comment -

            I'm okay with preventing that in non solrcloud mode as well.

            markrmiller@gmail.com Mark Miller added a comment - I'm okay with preventing that in non solrcloud mode as well.
            erickerickson Erick Erickson added a comment -

            4x: r - 1466319

            erickerickson Erick Erickson added a comment - 4x: r - 1466319
            uschindler Uwe Schindler added a comment -

            Closed after release.

            uschindler Uwe Schindler added a comment - Closed after release.

            People

              erickerickson Erick Erickson
              erickerickson Erick Erickson
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: