[SOLR-4663] Log an error if more than one core points to the same data dir. - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 4.3, 6.0
Fix Version/s: 4.3, 6.0
Component/s: None
Labels:
None

Description

In large multi-core setups, having mistakes whereby two or more cores point to the same data dir seems quite possible. We should at least complain very loudly in the logs if this is detected.

Should be a very straightforward check at core discovery time.

Is this serious enough to keep Solr from coming up at all?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-4663.patch
09/Apr/13 23:48
48 kB
Erick Erickson
SOLR-4663.patch
09/Apr/13 14:38
46 kB
Erick Erickson
SOLR-4663.patch
04/Apr/13 22:47
28 kB
Erick Erickson
SOLR-4663.patch
04/Apr/13 14:14
16 kB
Erick Erickson

Issue Links

incorporates

SOLR-1905 Multicore admin/cores?action=CREATE

Closed

SOLR-4347 Insure that newly-created cores via Admin handler are persisted in solr.xml

Closed

Activity

Ascending order - Click to sort in descending order

Ryan Ernst added a comment - 03/Apr/13 01:25

+1 to failing to startup in this serious misconfiguration.

Ryan Ernst added a comment - 03/Apr/13 01:25 +1 to failing to startup in this serious misconfiguration.

Erick Erickson added a comment - 04/Apr/13 14:14

Here's a patch. It fixes another issue I saw with persisting (I'll be soooo glad when that's obsoleted!).

It fails hard when either a core with a duplicate name is found or with a duplicate data dir, both with old and new style solr.xml. It's still possible for more than one core to share the same instance dir. I'd guess there are other checks that could be done, but this is a start.

My real question is whether it's appropriate to fail hard on startup or just log a very loud warning. Ryan's the only person who's weighed in on this so far. My preference is to fail hard but I'm not particularly invested either way.

I'll be checking this in tomorrow at the latest unless there are objections.

Erick Erickson added a comment - 04/Apr/13 14:14 Here's a patch. It fixes another issue I saw with persisting (I'll be soooo glad when that's obsoleted!). It fails hard when either a core with a duplicate name is found or with a duplicate data dir, both with old and new style solr.xml. It's still possible for more than one core to share the same instance dir. I'd guess there are other checks that could be done, but this is a start. My real question is whether it's appropriate to fail hard on startup or just log a very loud warning. Ryan's the only person who's weighed in on this so far. My preference is to fail hard but I'm not particularly invested either way. I'll be checking this in tomorrow at the latest unless there are objections.

Yonik Seeley added a comment - 04/Apr/13 14:22

Think about creating a core via API... pointing at another core's data directory should only only cause the create of that new core to fail.
Likewise, if someone messes with a core's configuration such that it fails on startup, it doesn't seem like this should stop the other correctly configured cores from loading.

Yonik Seeley added a comment - 04/Apr/13 14:22 Think about creating a core via API... pointing at another core's data directory should only only cause the create of that new core to fail. Likewise, if someone messes with a core's configuration such that it fails on startup, it doesn't seem like this should stop the other correctly configured cores from loading.

Erick Erickson added a comment - 04/Apr/13 15:07

bq: creating a core via API

This patch doesn't address this, at least not yet. Good point though, I'll see about adding that in.

bq: If someone messes with a core's configuration such that it fails on startup...

You always make my life more complicated <G>.... I can argue that either way, is this a fail-fast situation or is the robustness of driving on as well as possible worth the added difficulty in tracking down? The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error.

This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. That would avoid the complexity of checking every time we tried to open a core. The important bit is to fail without screwing up the index files for the offending cores (two cores pointing to the same dataDir). I expect that the current behavior for, say two cores with the same name is that we use one or the other, perhaps not consistently. It's Not Right to do anything at all IMO.

I think in the discovery case, though, the chance of copying the same core.properties file to multiple places, thus having cores with the same name or data dir is much more likely....

So do you actively object to failing fast? Or are you ok with failing fast but your comments are intended to make sure we're considering all the angles? Initially I didn't want to make things more complicated, but by just not putting the offending cores in the relevant lists I think the complexity argument goes away and preserving index integrity is maintained so I'm +/-0.

Let me know...

Erick Erickson added a comment - 04/Apr/13 15:07 bq: creating a core via API This patch doesn't address this, at least not yet. Good point though, I'll see about adding that in. bq: If someone messes with a core's configuration such that it fails on startup... You always make my life more complicated <G>.... I can argue that either way, is this a fail-fast situation or is the robustness of driving on as well as possible worth the added difficulty in tracking down? The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error. This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. That would avoid the complexity of checking every time we tried to open a core. The important bit is to fail without screwing up the index files for the offending cores (two cores pointing to the same dataDir). I expect that the current behavior for, say two cores with the same name is that we use one or the other, perhaps not consistently. It's Not Right to do anything at all IMO. I think in the discovery case, though, the chance of copying the same core.properties file to multiple places, thus having cores with the same name or data dir is much more likely.... So do you actively object to failing fast? Or are you ok with failing fast but your comments are intended to make sure we're considering all the angles? Initially I didn't want to make things more complicated, but by just not putting the offending cores in the relevant lists I think the complexity argument goes away and preserving index integrity is maintained so I'm +/-0. Let me know...

Chris M. Hostetter added a comment - 04/Apr/13 16:50

... The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error.

This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. ...

Don't forget that CoreContainer now has the ability to track and report (as part of "STATUS" requests) that cores failed to initialize (either on startup or via API calls) and why. (see ~~SOLR-3591~~ and child tasks)

This type of dataDir error should be no different: if coreA loads fine, and then coreB fails to load because it points at the same data dir as coreA that doesn't need to prevent the server from functioning, coreA should still be usable, and as long as the error is properly recorded in the CoreContainer the UI and the CoreAdminHandler will report back why coreB isn't available.

Chris M. Hostetter added a comment - 04/Apr/13 16:50 ... The fast fix to the current patch would be just to remove the offending cores from the lists of cores (both lazily loaded and load-on-startup) and log an error. This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message, which would be slightly misleading, but there'd be a big fat error (not info, not warn) in the log files so I'm not too worried. ... Don't forget that CoreContainer now has the ability to track and report (as part of "STATUS" requests) that cores failed to initialize (either on startup or via API calls) and why. (see SOLR-3591 and child tasks) This type of dataDir error should be no different: if coreA loads fine, and then coreB fails to load because it points at the same data dir as coreA that doesn't need to prevent the server from functioning, coreA should still be usable, and as long as the error is properly recorded in the CoreContainer the UI and the CoreAdminHandler will report back why coreB isn't available.

Erick Erickson added a comment - 04/Apr/13 17:18

~hossman Didn't know that, that sounds like it pretty much solves the issue. I'll look over the patch, but any tips on what "properly recorded" means (I have to run out now, so I'm being lazy, but I can look later.

Erick

Erick Erickson added a comment - 04/Apr/13 17:18 ~hossman Didn't know that, that sounds like it pretty much solves the issue. I'll look over the patch, but any tips on what "properly recorded" means (I have to run out now, so I'm being lazy, but I can look later. Erick

Chris M. Hostetter added a comment - 04/Apr/13 17:19

Minor Tangent...

...This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message,...

After posting my last comment, it occured to me that even though we are tracking core init failures and reporting them in STATUS requests, we are still returning 404s when people attempt to use those cores ... i've opened ~~SOLR-4672~~ to consier returning a 500 wraped around the init error.

Chris M. Hostetter added a comment - 04/Apr/13 17:19 Minor Tangent... ...This would imply that when someone tried to actually do something with the offending cores, they'd get a "core not found" message,... After posting my last comment, it occured to me that even though we are tracking core init failures and reporting them in STATUS requests, we are still returning 404s when people attempt to use those cores ... i've opened SOLR-4672 to consier returning a 500 wraped around the init error.

Erick Erickson added a comment - 04/Apr/13 22:47

All tests run with this patch, putting up for comment.

Still have to deal with Yonik's and Hoss's suggestions about recording core creation errors rather than hard-failing.

But the major change here is that creating cores via core-admin should correctly persist to solr.xml.

Persisting is an incredible pain. I'll be Sooooooo happy when we obsolete that nonsense.

Erick Erickson added a comment - 04/Apr/13 22:47 All tests run with this patch, putting up for comment. Still have to deal with Yonik's and Hoss's suggestions about recording core creation errors rather than hard-failing. But the major change here is that creating cores via core-admin should correctly persist to solr.xml. Persisting is an incredible pain. I'll be Sooooooo happy when we obsolete that nonsense.

Erick Erickson added a comment - 09/Apr/13 14:38

I'll commit this later today unless there are objections, and assuming all the tests pass (running nightly now).

Erick Erickson added a comment - 09/Apr/13 14:38 I'll commit this later today unless there are objections, and assuming all the tests pass (running nightly now).

Erick Erickson added a comment - 09/Apr/13 23:48

Last version, nothing really changed just cleanup.

markrmiller@gmail.com

In CoreAdminHandler, there's a bit of code specific to reporting same-named cores iff zkAware is set. I took out the ZK-specific parts and wanted to ping you in case that was a bad move. I also changed the error returned as per the discussion with Hoss.

trunk r: 1466291

Erick Erickson added a comment - 09/Apr/13 23:48 Last version, nothing really changed just cleanup. markrmiller@gmail.com In CoreAdminHandler, there's a bit of code specific to reporting same-named cores iff zkAware is set. I took out the ZK-specific parts and wanted to ping you in case that was a bad move. I also changed the error returned as per the discussion with Hoss. trunk r: 1466291

Mark Miller added a comment - 10/Apr/13 00:00

I'm okay with preventing that in non solrcloud mode as well.

Mark Miller added a comment - 10/Apr/13 00:00 I'm okay with preventing that in non solrcloud mode as well.

Erick Erickson added a comment - 10/Apr/13 03:03

4x: r - 1466319

Erick Erickson added a comment - 10/Apr/13 03:03 4x: r - 1466319

Uwe Schindler added a comment - 10/May/13 10:34

Closed after release.

Uwe Schindler added a comment - 10/May/13 10:34 Closed after release.

People

Assignee:: Erick Erickson

Reporter:: Erick Erickson

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Apr/13 15:39

Updated:: 09/May/16 18:43

Resolved:: 10/Apr/13 03:03