Solr
  1. Solr
  2. SOLR-4083

Deprecate specifying individual <core> information in solr.xml. Possibly deprecate solr.xml entirely

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 4.1, 5.0
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Spinoff from SOLR-1306. Having a solr.xml file is limiting and possibly unnecessary. We'd gain flexibility by having an "auto-discovery", essentially walking the directories and finding all the cores and just loading them.

      Here's an issue to start the discussion of what that would look like. At this point the way I'm thinking about it depends on SOLR-1306, which depends on SOLR-1028, so the chain is getting kind of long.

      Straw-man proposal:
      1> system properties can be specified as root paths in the solr tree to start discovery.
      2> the directory walking process will stop going deep (but not wide) in the directories whenever a solrcore.properties file is encountered. That file can contain any of the properties currently specifiable in a <core> tag. This allows, for instance, re-use of a single solrconfig.xml or schema.xml file across multiple cores. I really dont want to get into having cores-within-cores. While this latter is possible, I don't see any advantage. You can have multiple roots and there's no requirement that the cores be in the directory immediately below that root they can be arbitrarily deep.
      3> I'm not quite sure what to do with the various properties in the <cores> tag. Perhaps just require these to be system properties?
      4> Notice the title. Does it still make sense to specify <3> in solr.xml but ignore the cores stuff? It seems like so little information will be in solr.xml if we take all the <core> tags out that we should just kill it all together.
      5> Not quite sure what this means for where the cores live. Is it arbitrary? Anywyere on disk? Why not?
      6> core swapping/renaming/whatever. Really, this is about how we model persist="true" on solr.xml. It's easy if we keep solr.xml and just remove the individual core entries. Where to put them?
      7> if we're supposed to persist core admin operations, it seems like we just persist this stuff to the individual solrcore.properties files. Things like whether it's loaded, whether its name has changed (1028 allows lazy loading).
      8> This still provide the capability of your own custom CoreDescriptorProvider, which you'll have to specify somehow. I'm not quite sure where yet.

      solr.xml is really the bootstrap for the whole shootin' match. Removing it entirely means we have to specify root directories, zk parameters, whatever somehow. What do people think is the best option here? Leave a degenerate solr.xml? Require system properties be set for any of these options? Currently, the options we'll need are anything (actual or proposed) in the <solr> and <cores> tags.

      So, what the first cut at this would be, building on 1306, is a default CoreDescriptorProvider that ignored all the <core> entries in solr.xml, walked the tree and loaded all the cores found. I claim this is a quick thing to PoC assuming SOLR-1306 and I'll try to provide a patch demonstrating it over the weekend.

      But mostly, this is a place to start the discussion about what this would look like rather than have it get lost in SOLR-1306.

      finally, note that I have no intention of putting any of this into 4.x at least until we cut the 4.1/4.0.1 whatever.

      And, of course, until we fully deprecate solr.xml (5.0?) the current behavior will be the default.

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          I think the core sys properties can go.

          I think the core name should just be the folder name. It should auto discover cores in solrhome or solrhome/cores with the ability to add other locations with a sys prop.

          Some config file per core folder with any core metadata seems to make sense.

          Most top level stuff would be passed by sys prop.

          This still provide the capability of your own custom CoreDescriptorProvider, which you'll have to specify somehow.

          I'm still not convinced we need to provide a plugin point there. What are the gains for complication?

          Show
          Mark Miller added a comment - I think the core sys properties can go. I think the core name should just be the folder name. It should auto discover cores in solrhome or solrhome/cores with the ability to add other locations with a sys prop. Some config file per core folder with any core metadata seems to make sense. Most top level stuff would be passed by sys prop. This still provide the capability of your own custom CoreDescriptorProvider, which you'll have to specify somehow. I'm still not convinced we need to provide a plugin point there. What are the gains for complication?
          Hide
          Erick Erickson added a comment -

          MMMMMooooommmmmm! He keeps bugging me <G>...

          But maybe you've forced me to get to the core of the issue.

          I think the answer is that I don't know if it's acceptable to wait for startup until 10-15K directories are enumerated and any associated properties files are examined (that's the scale we're talking here). OTOH, I have no good intuition that it's not acceptable. I was originally thinking that the core information for large, complex installations could be kept in some DB somewhere, or even in a special Solr "meta-data" core, and accessed on demand since there is presumably a system-of-record for that information, possibly externally maintained. So you could have a much faster startup in this case than if you had to enumerate a (very) large tree structure.

          You're right that in a situation where the (pluggable or not) coreDescriptorProvider walked the directory structure anyway, there's no need to provide a way to plug anything, it'd take the same amount of time either way. But what about other ways of storing this?

          And I guess that assuming an infrastructure is out there somewhere, I can argue that provisioning the core (i.e. getting the directory created, getting the minimum directory structure in place etc) has to be done outside of this mechanism anyway. Once a site solves that problem, just walking the directory tree is sufficient, there's no need for additional coupling of the CoreDescriptorProvider to the physical directory structure. Get the structure right and you'd automatically have the info a CoreDescriptorProvider would return.

          And I suppose one could spin off a thread at startup so startup would actually be very fast at the expense of (possibly) waiting on your first core get until either 1> the directory tree was exhausted or 2> the core was found.

          Let me run an experiment or two to see what this looks like in practice. If it takes a minute or two to start up, especially if loading core info is a background thread, I may be (finally) forced to agree with you...

          Show
          Erick Erickson added a comment - MMMMMooooommmmmm! He keeps bugging me <G>... But maybe you've forced me to get to the core of the issue. I think the answer is that I don't know if it's acceptable to wait for startup until 10-15K directories are enumerated and any associated properties files are examined (that's the scale we're talking here). OTOH, I have no good intuition that it's not acceptable. I was originally thinking that the core information for large, complex installations could be kept in some DB somewhere, or even in a special Solr "meta-data" core, and accessed on demand since there is presumably a system-of-record for that information, possibly externally maintained. So you could have a much faster startup in this case than if you had to enumerate a (very) large tree structure. You're right that in a situation where the (pluggable or not) coreDescriptorProvider walked the directory structure anyway, there's no need to provide a way to plug anything, it'd take the same amount of time either way. But what about other ways of storing this? And I guess that assuming an infrastructure is out there somewhere, I can argue that provisioning the core (i.e. getting the directory created, getting the minimum directory structure in place etc) has to be done outside of this mechanism anyway. Once a site solves that problem, just walking the directory tree is sufficient, there's no need for additional coupling of the CoreDescriptorProvider to the physical directory structure. Get the structure right and you'd automatically have the info a CoreDescriptorProvider would return. And I suppose one could spin off a thread at startup so startup would actually be very fast at the expense of (possibly) waiting on your first core get until either 1> the directory tree was exhausted or 2> the core was found. Let me run an experiment or two to see what this looks like in practice. If it takes a minute or two to start up, especially if loading core info is a background thread, I may be (finally) forced to agree with you...
          Hide
          Mark Miller added a comment - - edited

          loading core info is a background thread

          We would use an executor, like we do now in 5x and 4x dev branches - so n cores would load at a time, but that would not need to affect the enumeration - you could queue up the load info. I have to imagine we can enumerate on a scale similar to storing this data elsewhere. It might be a bit less efficient, but I doubt to the point that it would argue for opening and maintaining a plugin point like this - I doubt it's going to be exponential. I'm also not sure why you would load this many cores at once - I thought the whole point was that a subset would be loaded and the rest on demand/lazily - so mostly you will just be enumerating very quickly.

          Some OS's do slow down with many thousands of directories in one dir - but you will have the same issue with 1000's of core regardless - they all have a solrhome and index dir, and at that scale you probably want to create some form of hierarchy strategy for folders.

          I'm not dead set against some sort of plugin here - but you have to weigh that vs the complexity and cost when users start using it and it makes future refactoring more difficult. I'm just not yet convinced any advantages outweigh that cost yet.

          And I'm not sure just having a use case is enough - do enough people have that use case, are there simpler ways to solve the problem more generally, etc. These are all important factors.

          Show
          Mark Miller added a comment - - edited loading core info is a background thread We would use an executor, like we do now in 5x and 4x dev branches - so n cores would load at a time, but that would not need to affect the enumeration - you could queue up the load info. I have to imagine we can enumerate on a scale similar to storing this data elsewhere. It might be a bit less efficient, but I doubt to the point that it would argue for opening and maintaining a plugin point like this - I doubt it's going to be exponential. I'm also not sure why you would load this many cores at once - I thought the whole point was that a subset would be loaded and the rest on demand/lazily - so mostly you will just be enumerating very quickly. Some OS's do slow down with many thousands of directories in one dir - but you will have the same issue with 1000's of core regardless - they all have a solrhome and index dir, and at that scale you probably want to create some form of hierarchy strategy for folders. I'm not dead set against some sort of plugin here - but you have to weigh that vs the complexity and cost when users start using it and it makes future refactoring more difficult. I'm just not yet convinced any advantages outweigh that cost yet. And I'm not sure just having a use case is enough - do enough people have that use case, are there simpler ways to solve the problem more generally, etc. These are all important factors.
          Hide
          Erick Erickson added a comment -

          bq: I'm also not sure why you would load this many cores at once - I thought the whole point was that a subset would be loaded and the rest on demand/lazily

          I'm not thinking about loading them at all, just getting the information about them (really creating a CoreDescriptor for each I think). Say a request comes in for core Z. I'm assuming that putting all 10K cores in the same place (i.e. 15K core directories under <solr_home> is not acceptable if for no other reason than keeping them organized, as well as the 10K subdirs in a single dir problem. We'd want to allow an arbitrarily deep tree. I was assuming that it would be easiest to generate a map of cores->directories to be able to autoload the cores (and cap the number of currently open cores) given all we have to work with at that point is the core name. So I'm thinking of generating that map at startup. Ditto for getting the list of all available cores etc.

          FWIW, I played with writing 15K config files out and the program to do that is pretty fast w/o any optimizations. Trying read next. Of course that's on my SSD; I'll give it a whirl on my old spinning-disk laptop tonight or tomorrow. And I can imagine throwing a bunch of threads at the issue if necessary (which I'm not yet convinced it is).

          Show
          Erick Erickson added a comment - bq: I'm also not sure why you would load this many cores at once - I thought the whole point was that a subset would be loaded and the rest on demand/lazily I'm not thinking about loading them at all, just getting the information about them (really creating a CoreDescriptor for each I think). Say a request comes in for core Z. I'm assuming that putting all 10K cores in the same place (i.e. 15K core directories under <solr_home> is not acceptable if for no other reason than keeping them organized, as well as the 10K subdirs in a single dir problem. We'd want to allow an arbitrarily deep tree. I was assuming that it would be easiest to generate a map of cores->directories to be able to autoload the cores (and cap the number of currently open cores) given all we have to work with at that point is the core name. So I'm thinking of generating that map at startup. Ditto for getting the list of all available cores etc. FWIW, I played with writing 15K config files out and the program to do that is pretty fast w/o any optimizations. Trying read next. Of course that's on my SSD; I'll give it a whirl on my old spinning-disk laptop tonight or tomorrow. And I can imagine throwing a bunch of threads at the issue if necessary (which I'm not yet convinced it is).
          Hide
          Erick Erickson added a comment -

          OK, Old Macbook Pro (2009 vintage). Fresh boot and 15,000 cores in 150 directories 100 cores in each dir. Reading and parsing all the config files takes 15 seconds or so, worst case scenario. Each core.properties file had the properties listed below (although I doubt having only 1-2 entries/file would make much difference).
          instanceDir
          schema
          config
          name
          dataDir
          shard
          collection
          roles
          properties

          So I decree that this is fast enough and we don't need to do anything fancy here. NOTE: all this really does is create something akin to CoreDescriptors for each of these dirs.

          Robert and Yonik:
          FWIW, thanks for pushing me on this, I'm not at all sure I'd have reached this level of simplicity on my own....

          Show
          Erick Erickson added a comment - OK, Old Macbook Pro (2009 vintage). Fresh boot and 15,000 cores in 150 directories 100 cores in each dir. Reading and parsing all the config files takes 15 seconds or so, worst case scenario. Each core.properties file had the properties listed below (although I doubt having only 1-2 entries/file would make much difference). instanceDir schema config name dataDir shard collection roles properties So I decree that this is fast enough and we don't need to do anything fancy here. NOTE: all this really does is create something akin to CoreDescriptors for each of these dirs. Robert and Yonik: FWIW, thanks for pushing me on this, I'm not at all sure I'd have reached this level of simplicity on my own....
          Hide
          Erick Erickson added a comment -

          I tried a simple threading experiment and it didn't help. In fact it hurt by a few percent on my spinning-disk machine. I guess disk I/O bound is...well...disk I/O bound.

          Show
          Erick Erickson added a comment - I tried a simple threading experiment and it didn't help. In fact it hurt by a few percent on my spinning-disk machine. I guess disk I/O bound is...well...disk I/O bound.
          Hide
          Yonik Seeley added a comment -

          Great to see we're all getting on the same page!

          I think it could still make sense to have some per-node persistent state (just not per-core/shard). Whether we want that to be solr.xml, or start over with something simpler, I don't know. What would go in there? I imagine that any config that could be programmatically changed (as opposed to a required start parameter), and that doesn't make sense to be in ZK (or that the node tells ZK rather than the other way around). The node ID would be one thing (if we didn't auto-construct it from host/port). Currently I don't think we have a place in ZK to store node properties (such as roles, capacity, rack/location, etc). I'm really just brainstorming here... there doesn't seem to be anything that we currently need solr.xml for.

          Something else I've been thinking about is how to deal with the "conf" under each core/shard... if multiple shards in a node use the same config, or multiple collections use the same config, it would be nice not have to repeat that config or reference inside another core's "conf" directory. We solved this in ZK by having a "configs" directory and then referencing the config by name. Perhaps we could do the same thing on the filesystem, but have a "local_configs" directory that's only used when not in ZK mode (of course there are Mark's points about only having ZK mode in the future... but even then, it's nice to have something local to edit to bootstrap ZK with?)

          Show
          Yonik Seeley added a comment - Great to see we're all getting on the same page! I think it could still make sense to have some per-node persistent state (just not per-core/shard). Whether we want that to be solr.xml, or start over with something simpler, I don't know. What would go in there? I imagine that any config that could be programmatically changed (as opposed to a required start parameter), and that doesn't make sense to be in ZK (or that the node tells ZK rather than the other way around). The node ID would be one thing (if we didn't auto-construct it from host/port). Currently I don't think we have a place in ZK to store node properties (such as roles, capacity, rack/location, etc). I'm really just brainstorming here... there doesn't seem to be anything that we currently need solr.xml for. Something else I've been thinking about is how to deal with the "conf" under each core/shard... if multiple shards in a node use the same config, or multiple collections use the same config, it would be nice not have to repeat that config or reference inside another core's "conf" directory. We solved this in ZK by having a "configs" directory and then referencing the config by name. Perhaps we could do the same thing on the filesystem, but have a "local_configs" directory that's only used when not in ZK mode (of course there are Mark's points about only having ZK mode in the future... but even then, it's nice to have something local to edit to bootstrap ZK with?)
          Hide
          Erick Erickson added a comment -

          Yeah, I've been punting with the shared configs (both schema and config). My thought so far is that that would be an implementation detail for the person setting things up. Or at least that approach would do for now. I don't think there is any reason that you couldn't have a local_configs directory, and just have each individual core.properties file specify the correct path. Something like:

          schema=../../../local_configs/schema_1.xml.

          There's already code in place to have various cores share a schema file (don't quite know about solrconfig) that's keyed by the path of the file, so these entries would all be distinct

          schema=../../../local_configs/schema_1.xml
          schema=../../../local_configs1/schema_1.xml
          schema=../../local_configs/schema_1.xml
          schema=../../../local_configs/schema_2.xml

          But I wouldn't want to require this, the model of "just copy the whole core somewhere and fire it up" is nice and simple, you could leave the refinements to people who felt the need. And if the above works, you get both...

          I'm not sure what I think about moving the entries in solr.xml out to, say, system properties, it seems kind of opaque. But I suppose that having them there rather than solr.xml isn't any more difficult, just different. And system props are easier to script I'd guess.

          I've also been thinking that the individual core.properties files should be persistable, probably on a whole-node basis. Otherwise I'm not sure how you'd support things like core renaming.

          Show
          Erick Erickson added a comment - Yeah, I've been punting with the shared configs (both schema and config). My thought so far is that that would be an implementation detail for the person setting things up. Or at least that approach would do for now. I don't think there is any reason that you couldn't have a local_configs directory, and just have each individual core.properties file specify the correct path. Something like: schema=../../../local_configs/schema_1.xml. There's already code in place to have various cores share a schema file (don't quite know about solrconfig) that's keyed by the path of the file, so these entries would all be distinct schema=../../../local_configs/schema_1.xml schema=../../../local_configs1/schema_1.xml schema=../../local_configs/schema_1.xml schema=../../../local_configs/schema_2.xml But I wouldn't want to require this, the model of "just copy the whole core somewhere and fire it up" is nice and simple, you could leave the refinements to people who felt the need. And if the above works, you get both... I'm not sure what I think about moving the entries in solr.xml out to, say, system properties, it seems kind of opaque. But I suppose that having them there rather than solr.xml isn't any more difficult, just different. And system props are easier to script I'd guess. I've also been thinking that the individual core.properties files should be persistable, probably on a whole-node basis. Otherwise I'm not sure how you'd support things like core renaming.
          Hide
          Mark Miller added a comment -

          We solved this in ZK by having a "configs" directory and then referencing the config by name.

          If a non ZooKeeper mode is to stay around, it would be best to replicate the same model IMO. A separate folder of configs that each core can reference by config set name. Easy to share, easy to not share, alignment with SolrCloud setup. Bootstrapping fro a local setup to a cloud setup is easier and more straightforward. Documentation is less of a cluster ***.

          Local mode must go away, or local should live in harmony with SolrCloud. It's not totally the case today because we needed to support the old hardened stuff while SolrCloud was built - but not moving towards harmony or unification is not an option IMO. It has to be done for everyones sake - user and dev. Things like master/slave replication, if still desirable, should become SolrCloud features. Other things perhaps should be dropped.

          Show
          Mark Miller added a comment - We solved this in ZK by having a "configs" directory and then referencing the config by name. If a non ZooKeeper mode is to stay around, it would be best to replicate the same model IMO. A separate folder of configs that each core can reference by config set name. Easy to share, easy to not share, alignment with SolrCloud setup. Bootstrapping fro a local setup to a cloud setup is easier and more straightforward. Documentation is less of a cluster ***. Local mode must go away, or local should live in harmony with SolrCloud. It's not totally the case today because we needed to support the old hardened stuff while SolrCloud was built - but not moving towards harmony or unification is not an option IMO. It has to be done for everyones sake - user and dev. Things like master/slave replication, if still desirable, should become SolrCloud features. Other things perhaps should be dropped.
          Hide
          Erick Erickson added a comment -

          So, for the nonce (4.x) assuming we're supporting both, what's the precedence?

          1> if a configs directory is around, look for the named subdirectory by name and use it. Question: This is an entire "conf" directory, right?

          2> if <1> fails then look for a "conf" directory in the collection dir and look there.

          OR

          <1> do things the way we are now in the absence of a flag of some sort
          <2> if the flag is present (names welcome), do things the new way, i.e. search for named configs in a new directory. What should be the default location? <solr_home>/configs? I suppose the "flag" would be something like -DconfigsDir=blahblahblah....

          My off-the-top-of-my-head preference is for the second way, one way or the other. Trying to allow a hybrid approach seems like a rats-nest to no good purpose and a lot of confusion... "your installation is mis-behaving because you're not using the configs you think you are and the algorithm for finding the right one is....." Yuuuccckkk. Much harder to explain/understand than "if you specify -Dconfigs then all your configs are read from that directory. If not, they're read from your collection directory".

          Show
          Erick Erickson added a comment - So, for the nonce (4.x) assuming we're supporting both, what's the precedence? 1> if a configs directory is around, look for the named subdirectory by name and use it. Question: This is an entire "conf" directory, right? 2> if <1> fails then look for a "conf" directory in the collection dir and look there. OR <1> do things the way we are now in the absence of a flag of some sort <2> if the flag is present (names welcome), do things the new way, i.e. search for named configs in a new directory. What should be the default location? <solr_home>/configs? I suppose the "flag" would be something like -DconfigsDir=blahblahblah.... My off-the-top-of-my-head preference is for the second way, one way or the other. Trying to allow a hybrid approach seems like a rats-nest to no good purpose and a lot of confusion... "your installation is mis-behaving because you're not using the configs you think you are and the algorithm for finding the right one is....." Yuuuccckkk. Much harder to explain/understand than "if you specify -Dconfigs then all your configs are read from that directory. If not, they're read from your collection directory".
          Hide
          Yonik Seeley added a comment -

          Just to be clear Erick, you don't need to solve named configs in this issue (but it is good to think ahead while implementing any feature).

          Show
          Yonik Seeley added a comment - Just to be clear Erick, you don't need to solve named configs in this issue (but it is good to think ahead while implementing any feature).
          Hide
          Mark Miller added a comment -

          Some initial thoughts:

          What if we handled meta data with a simple properties file at the root of each core directory (solrhome for that core). Fast to parse, if it has a good name it's easy to identify cores, etc.
          One of the properties could be a config set name? Only if that file exists and specifies a conf set name would you use the new logic.

          And if you found a solr.xml for back compat, you wouldn't even start looking for core dirs - but if solr.xml is not there start looking for xxxx.properties in each folder found in solrhome?

          Once you have read xxxx.properties, you either find a conf set in it, or probably kick into the rules solrcloud uses - if there is only one conf set, use that, if there is a conf set with the same name as the core, use that, etc.

          For way back bat compat where we didn't have a solr.xml and the example was single core, I'm not yet sure what a good idea is. It could be something like, if we find no xxxx.properties in any sub dirs and solrhome has a conf dir sub directory, use the built in back compat, single core solr.xml that we already have in CoreContainer.

          Then in 5x require the new style.

          Show
          Mark Miller added a comment - Some initial thoughts: What if we handled meta data with a simple properties file at the root of each core directory (solrhome for that core). Fast to parse, if it has a good name it's easy to identify cores, etc. One of the properties could be a config set name? Only if that file exists and specifies a conf set name would you use the new logic. And if you found a solr.xml for back compat, you wouldn't even start looking for core dirs - but if solr.xml is not there start looking for xxxx.properties in each folder found in solrhome? Once you have read xxxx.properties, you either find a conf set in it, or probably kick into the rules solrcloud uses - if there is only one conf set, use that, if there is a conf set with the same name as the core, use that, etc. For way back bat compat where we didn't have a solr.xml and the example was single core, I'm not yet sure what a good idea is. It could be something like, if we find no xxxx.properties in any sub dirs and solrhome has a conf dir sub directory, use the built in back compat, single core solr.xml that we already have in CoreContainer. Then in 5x require the new style.
          Hide
          Erick Erickson added a comment -

          Looking for some advice here. The problem is that parsing the solr.xml file in CoreContainer.java is so intimately tied in to the fact that it's xml through the use of the Config class. It starts out simply in initialize, but then tendrils of XML-ese (XPath) is scattered all over the file.

          So, I've thought of a few options.
          1> I could be really slimy and, in initialize instead of using the canned solr.xml (DEF_SOLR_XML), I could construct a new string that looked just like a really big solr.xml file from all the cores that are discovered and pass that in just as things are done now. But I think I'd have to wash my hands afterwards. That doesn't get us any forwarder in terms of obsoleting solr.xml. It'd be fast though with minimal code disruption. But it'd build an XML file just to parse it into the DOM. Wasteful.

          2> Abstract all of the current XPath/XML stuff specific in CoreContainer into a thunking layer that understood both ways of looking at the world. If it was initialized from a current solr.xml, it'd just pass all the stuff right through to the current Config. If it was populated by discovery, resolve the request "natively". So, for instance, a call like

          cfg.getInt("solr/cores/@swappableCacheSize", Integer.MAX_VALUE)
          in CoreContainer would be replaced by
          newcfg.getCoresInt("swappableCacheSize", Integer.MAX_VALUE)

          Under the covers, this would resolve to something like
          if (initialized from discovery) return newcfg.coresprops.getInt("swappableCacheSize", Integer.MAX_VALUE);
          else return oldcfg.getInt("solr/cores/@swappableCacheSize", Integer.MAX_VALUE);

          There'd be a newcfg.solr.getget### for things defined in the <solr ....> tag etc.

          3> Take my current notion of a pluggable CoreDescriptorProvider and just make it not pluggable. Populate it up front with the discovery process and go from there. This seems more consistent with the ZK CoreDescriptorProvider that's already there.

          4> ?????

          I think the whole question of whether config files live in a central directory and the rest of this discussion is orthogonal to this issue. There's really two questions here I guess.
          a> is the trouble/complexity of a thunking layer worth the effort? It'll go a ways towards separating out the requirement of XML parsing for solr.xml at a cost of that complexity. Not sure it's a good call.
          b> Is there another approach that I'm overlooking?

          Of the three, I'm torn between the <2> and <3>, I can argue either way. <1> seems really hacky.

          Show
          Erick Erickson added a comment - Looking for some advice here. The problem is that parsing the solr.xml file in CoreContainer.java is so intimately tied in to the fact that it's xml through the use of the Config class. It starts out simply in initialize, but then tendrils of XML-ese (XPath) is scattered all over the file. So, I've thought of a few options. 1> I could be really slimy and, in initialize instead of using the canned solr.xml (DEF_SOLR_XML), I could construct a new string that looked just like a really big solr.xml file from all the cores that are discovered and pass that in just as things are done now. But I think I'd have to wash my hands afterwards. That doesn't get us any forwarder in terms of obsoleting solr.xml. It'd be fast though with minimal code disruption. But it'd build an XML file just to parse it into the DOM. Wasteful. 2> Abstract all of the current XPath/XML stuff specific in CoreContainer into a thunking layer that understood both ways of looking at the world. If it was initialized from a current solr.xml, it'd just pass all the stuff right through to the current Config. If it was populated by discovery, resolve the request "natively". So, for instance, a call like cfg.getInt("solr/cores/@swappableCacheSize", Integer.MAX_VALUE) in CoreContainer would be replaced by newcfg.getCoresInt("swappableCacheSize", Integer.MAX_VALUE) Under the covers, this would resolve to something like if (initialized from discovery) return newcfg.coresprops.getInt("swappableCacheSize", Integer.MAX_VALUE); else return oldcfg.getInt("solr/cores/@swappableCacheSize", Integer.MAX_VALUE); There'd be a newcfg.solr.getget### for things defined in the <solr ....> tag etc. 3> Take my current notion of a pluggable CoreDescriptorProvider and just make it not pluggable. Populate it up front with the discovery process and go from there. This seems more consistent with the ZK CoreDescriptorProvider that's already there. 4> ????? I think the whole question of whether config files live in a central directory and the rest of this discussion is orthogonal to this issue. There's really two questions here I guess. a> is the trouble/complexity of a thunking layer worth the effort? It'll go a ways towards separating out the requirement of XML parsing for solr.xml at a cost of that complexity. Not sure it's a good call. b> Is there another approach that I'm overlooking? Of the three, I'm torn between the <2> and <3>, I can argue either way. <1> seems really hacky.
          Hide
          Erick Erickson added a comment - - edited

          When the work on 4083 is done, this should be resolved. This is just a marker to clean up.

          4083 == infinite recursion, meant SOLR-4196

          Show
          Erick Erickson added a comment - - edited When the work on 4083 is done, this should be resolved. This is just a marker to clean up. 4083 == infinite recursion, meant SOLR-4196
          Hide
          Mark Miller added a comment -

          I'm going to try and take a closer look at this today.

          Show
          Mark Miller added a comment - I'm going to try and take a closer look at this today.
          Hide
          Shawn Heisey added a comment -

          Discussion on SOLR-4450 has led to an idea that I think is important: Separating zookeeper information (zkHost, zkClientTimeout, hostPort, hostContext, etc) from everything else, particularly information about defined cores.

          Currently zookeeper and core info are both found in solr.xml, so you can't just copy the config file unmodified from one SolrCloud server to another to obtain or update zookeeper info. With attention on getting rid of solr.xml, I think these bits of information should end up in separate files.

          Show
          Shawn Heisey added a comment - Discussion on SOLR-4450 has led to an idea that I think is important: Separating zookeeper information (zkHost, zkClientTimeout, hostPort, hostContext, etc) from everything else, particularly information about defined cores. Currently zookeeper and core info are both found in solr.xml, so you can't just copy the config file unmodified from one SolrCloud server to another to obtain or update zookeeper info. With attention on getting rid of solr.xml, I think these bits of information should end up in separate files.
          Hide
          Erick Erickson added a comment -

          Shawn:

          I haven't thought this through, but could you raise a JIRA to this effect? Currently, the ZK information is in solr.properties which replaces solr.xml as far as defining individual cores is concerned. However, it still allows you to define all the stuff (including ZK info) that used to be in the <cores...> tag....

          Go ahead and assign it to me if you want. Mostly I'm making sure you still think this is an issue, I suspect it is...

          Thanks,
          Erick

          Show
          Erick Erickson added a comment - Shawn: I haven't thought this through, but could you raise a JIRA to this effect? Currently, the ZK information is in solr.properties which replaces solr.xml as far as defining individual cores is concerned. However, it still allows you to define all the stuff (including ZK info) that used to be in the <cores...> tag.... Go ahead and assign it to me if you want. Mostly I'm making sure you still think this is an issue, I suspect it is... Thanks, Erick

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Erick Erickson
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development