Solr
  1. Solr
  2. SOLR-4196

Untangle XML-specific nature of Config and Container classes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, Trunk
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      sub-task for SOLR-4083. If we're going to try to obsolete solr.xml, we need to pull all of the specific XML processing out of Config and Container. Currently, we refer to xpaths all over the place. This JIRA is about providing a thunking layer to isolate the XML-esque nature of solr.xml and allow a simple properties file to be used instead which will lead, eventually, to solr.xml going away.

      1. StressTest.zip
        8 kB
        Erick Erickson
      2. StressTest.zip
        8 kB
        Erick Erickson
      3. StressTest.zip
        9 kB
        Erick Erickson
      4. StressTest.zip
        10 kB
        Erick Erickson
      5. SOLR-4196.patch
        53 kB
        Erick Erickson
      6. SOLR-4196.patch
        60 kB
        Erick Erickson
      7. SOLR-4196.patch
        61 kB
        Erick Erickson
      8. SOLR-4196.patch
        88 kB
        Erick Erickson
      9. SOLR-4196.patch
        111 kB
        Erick Erickson
      10. SOLR-4196.patch
        131 kB
        Erick Erickson
      11. SOLR-4196.patch
        137 kB
        Erick Erickson
      12. SOLR-4196.patch
        138 kB
        Erick Erickson
      13. SOLR-4196.patch
        166 kB
        Erick Erickson
      14. SOLR-4196.patch
        173 kB
        Erick Erickson
      15. SOLR-4196.patch
        176 kB
        Erick Erickson
      16. SOLR-4196.patch
        178 kB
        Erick Erickson
      17. SOLR-4196.patch
        185 kB
        Erick Erickson
      18. SOLR-4196.patch
        185 kB
        Erick Erickson

        Issue Links

          Activity

          Hide
          Erick Erickson added a comment -

          Started to do this as part of SOLR-4083, but that was a mess.

          Show
          Erick Erickson added a comment - Started to do this as part of SOLR-4083 , but that was a mess.
          Hide
          Jack Krupansky added a comment -

          I notice this has fix version of 4.1... there are some people who are thinking that a 4.1 release could occur in the fairly near future - is the work on this issue likely to be finished and tested and stabilized very quickly, like within the next few weeks? It sounds more like a 4.x > 4.1 if not 5.0. In fact, if the goal is to eliminate solr.xml, is that maybe too disruptive a change for any 4.x? OTOH, if this specific issue is simply "cleanup" in preparation for a separate issue to actually eliminate solr.xml, that would be different. Actually, I do now see that this is only a sub-task of SOLR-4083 - which has no fix version at all.

          Show
          Jack Krupansky added a comment - I notice this has fix version of 4.1... there are some people who are thinking that a 4.1 release could occur in the fairly near future - is the work on this issue likely to be finished and tested and stabilized very quickly, like within the next few weeks? It sounds more like a 4.x > 4.1 if not 5.0. In fact, if the goal is to eliminate solr.xml, is that maybe too disruptive a change for any 4.x? OTOH, if this specific issue is simply "cleanup" in preparation for a separate issue to actually eliminate solr.xml, that would be different. Actually, I do now see that this is only a sub-task of SOLR-4083 - which has no fix version at all.
          Hide
          Erick Erickson added a comment -

          Don't worry about it, the other one probaby won't be in 4.1 either. They'll all get bumped to 4.2 or whatever.

          Show
          Erick Erickson added a comment - Don't worry about it, the other one probaby won't be in 4.1 either. They'll all get bumped to 4.2 or whatever.
          Hide
          Erick Erickson added a comment -

          I'm starting by trying to pull all XML references out of CoreContainer. Frankly it's turning nasty, assumptions about parsing XML are scattered all over the place.

          Does anyone have some grand scheme in mind for handling this? I've got an approach, but it's tedious, mostly making a new ConfigSolr class that provides a thunking layer. Removing all the references to XML, DOM, wc3 sure makes a lot of red stuff in IntelliJ.....

          All I'm looking for here is if I'm overlooking the obvious. Don't want to get all through with it and discover there was a simpler way someone had already scoped out. I'm not about to make a complete copy of CoreContainer and have to maintain both.....

          Show
          Erick Erickson added a comment - I'm starting by trying to pull all XML references out of CoreContainer. Frankly it's turning nasty, assumptions about parsing XML are scattered all over the place. Does anyone have some grand scheme in mind for handling this? I've got an approach, but it's tedious, mostly making a new ConfigSolr class that provides a thunking layer. Removing all the references to XML, DOM, wc3 sure makes a lot of red stuff in IntelliJ..... All I'm looking for here is if I'm overlooking the obvious. Don't want to get all through with it and discover there was a simpler way someone had already scoped out. I'm not about to make a complete copy of CoreContainer and have to maintain both.....
          Hide
          Ryan McKinley added a comment -

          Does anyone have some grand scheme in mind for handling this?

          I tried a few years back.... as you you see it is pretty hairy!

          I think the right approach is to have java objects that represent the configs. Then have a different class that can read (write?) the configs to XML (or json, etc)

          Show
          Ryan McKinley added a comment - Does anyone have some grand scheme in mind for handling this? I tried a few years back.... as you you see it is pretty hairy! I think the right approach is to have java objects that represent the configs. Then have a different class that can read (write?) the configs to XML (or json, etc)
          Hide
          Mark Miller added a comment -

          Yeah, eventually we want to get rid of all that. Doing it all at once seems difficult though.

          Perhaps you can build an in memory xml dom from the directory layout and pass it around. This makes back compat support with the current solr.xml fairly easy.

          On the other hand, solr.xml does not have that much to it...perhaps I can try and lend a hand sometime soon - I can at least do a little more investigation to see.

          Show
          Mark Miller added a comment - Yeah, eventually we want to get rid of all that. Doing it all at once seems difficult though. Perhaps you can build an in memory xml dom from the directory layout and pass it around. This makes back compat support with the current solr.xml fairly easy. On the other hand, solr.xml does not have that much to it...perhaps I can try and lend a hand sometime soon - I can at least do a little more investigation to see.
          Hide
          Erick Erickson added a comment -

          @Ryan
          I have a ConfigSolr object that I'm using that understands whether the source is xml or properties. I'm trying to keep xml and DOM out of that and moving the bits specific to XML to Config. It'll handle properties, and (hopefully) make it easy to nuke solr.xml by taking all the "if (fromxml) call the Config methods, else do it myself. Sounds close enough to what you're thinking?

          BTW, I'm going to name ConfigSolr something else, suggestions? There's a SolrConfig object already. I still remember a programming class in college where I had BlortBlivet and BlivetBlort classes and getting them mixed up....

          @Mark
          The hard part isn't the complexity of solr.xml, it's all the places where XML is assumed. It's just tedious and error prone. Sure hope the tests work really, really well <G>.....

          For instance, ZkController.bootstrap gets a Config object and then quite reasonably assumes it's got access to the DOM. So that'll have to change some too. Not a lot, just a single method...

          I hope by Sunday I'll have some compilable code ready to look at. If I'm lucky, I'll have it working with the current (xml) schema and the bits for a .properties file stubbed out. At that point I'll have something others can look at and put up a nocommit patch. I'm almost, but not quite totally, sure that I'm going to screw something up here, I'm moving a lot of code around.

          On which note, merging other changes to CoreContainer is going to be...interesting...

          As you can tell I can't spend a lot of time on it each day, other commitments....

          Show
          Erick Erickson added a comment - @Ryan I have a ConfigSolr object that I'm using that understands whether the source is xml or properties. I'm trying to keep xml and DOM out of that and moving the bits specific to XML to Config. It'll handle properties, and (hopefully) make it easy to nuke solr.xml by taking all the "if (fromxml) call the Config methods, else do it myself. Sounds close enough to what you're thinking? BTW, I'm going to name ConfigSolr something else, suggestions? There's a SolrConfig object already. I still remember a programming class in college where I had BlortBlivet and BlivetBlort classes and getting them mixed up.... @Mark The hard part isn't the complexity of solr.xml, it's all the places where XML is assumed. It's just tedious and error prone. Sure hope the tests work really, really well <G>..... For instance, ZkController.bootstrap gets a Config object and then quite reasonably assumes it's got access to the DOM. So that'll have to change some too. Not a lot, just a single method... I hope by Sunday I'll have some compilable code ready to look at. If I'm lucky, I'll have it working with the current (xml) schema and the bits for a .properties file stubbed out. At that point I'll have something others can look at and put up a nocommit patch. I'm almost, but not quite totally, sure that I'm going to screw something up here, I'm moving a lot of code around. On which note, merging other changes to CoreContainer is going to be...interesting... As you can tell I can't spend a lot of time on it each day, other commitments....
          Hide
          Mark Miller added a comment -

          Yeah, but since the xml is not very complex, most of it just seemed like grunt work, and perhaps not too much of it? I have not looked too closely in a while, but I would have approached it by just making a 'solr.xml' java object that I stuck the settings on and then pass it around same as the xml dom was passed around. Its just replacing one value holder for another right?

          bootstrapConf for example, just wants to get a couple of the properties off the cores. This is probably going to be a common thing - getting both corecontainer and core level properties...that is what solr.xml is all about.

          So like in the persist code I would make a class to represent a cores properties and then I'd make a class to represent the corecontainers properties. Then I'd compose these classes based on teh dir layout and configs and sys properties and end up with a core container object with n properties and y core objects with x properties. boostrapConf would get to easily pull the core properties it wants, and so would every other place that ends up needing to look at a cores properties. CoreContainer level properties would also be available to pass around.

          Show
          Mark Miller added a comment - Yeah, but since the xml is not very complex, most of it just seemed like grunt work, and perhaps not too much of it? I have not looked too closely in a while, but I would have approached it by just making a 'solr.xml' java object that I stuck the settings on and then pass it around same as the xml dom was passed around. Its just replacing one value holder for another right? bootstrapConf for example, just wants to get a couple of the properties off the cores. This is probably going to be a common thing - getting both corecontainer and core level properties...that is what solr.xml is all about. So like in the persist code I would make a class to represent a cores properties and then I'd make a class to represent the corecontainers properties. Then I'd compose these classes based on teh dir layout and configs and sys properties and end up with a core container object with n properties and y core objects with x properties. boostrapConf would get to easily pull the core properties it wants, and so would every other place that ends up needing to look at a cores properties. CoreContainer level properties would also be available to pass around.
          Hide
          Erick Erickson added a comment -

          Mark:

          I think this approach is close enough to what I'm doing for a first approximation. I should have some code for all to examine (with the non-XML-based properties stuff stubbed out) pretty soon and we can see if we're all aligned or not.

          Show
          Erick Erickson added a comment - Mark: I think this approach is close enough to what I'm doing for a first approximation. I should have some code for all to examine (with the non-XML-based properties stuff stubbed out) pretty soon and we can see if we're all aligned or not.
          Hide
          Erick Erickson added a comment -

          OK, here's a patch. It has lots of stuff left to do, I'm just putting it up for your delectation. And anyone who wants to jump in and help, please feel free....

          But it's a lot of code changes, I'm amazed that it works as well as it does.

          There are lots of TODOs. Nobody should even think about committing this. What it does is separate out all of the XML-specific code from CoreContainer (and a couple of other classes) and provide a thunking class. The thunking class doesn't yet implement doing anything with a properties file, it just sends calls to the old Config class.

          It also moves a bunch of stuff to the old Config class that may not really belong there since they're specific to solr.xml not general xml config classes. It seems like it'd be cleaner to create a new class to hold the stuff specific to solr.xml, then we'd have a SolrProperties and SolrOldXml class or some such. But I'll leave that for a bit later, want to get this stuff up into the JIRA for people to take a look at for the general approach, we can deal with ugly details before it's finally committed.

          I'm fairly pleased that only a few tests are failing:
          BasicDistributedZkTest
          ZkCLITest

          I'll probably be able to take a look at these later this holiday, but I just managed to fix up 208 other failing cases (alright, it only took one code change, but still <G>).

          Show
          Erick Erickson added a comment - OK, here's a patch. It has lots of stuff left to do, I'm just putting it up for your delectation. And anyone who wants to jump in and help, please feel free.... But it's a lot of code changes, I'm amazed that it works as well as it does. There are lots of TODOs. Nobody should even think about committing this. What it does is separate out all of the XML-specific code from CoreContainer (and a couple of other classes) and provide a thunking class. The thunking class doesn't yet implement doing anything with a properties file, it just sends calls to the old Config class. It also moves a bunch of stuff to the old Config class that may not really belong there since they're specific to solr.xml not general xml config classes. It seems like it'd be cleaner to create a new class to hold the stuff specific to solr.xml, then we'd have a SolrProperties and SolrOldXml class or some such. But I'll leave that for a bit later, want to get this stuff up into the JIRA for people to take a look at for the general approach, we can deal with ugly details before it's finally committed. I'm fairly pleased that only a few tests are failing: BasicDistributedZkTest ZkCLITest I'll probably be able to take a look at these later this holiday, but I just managed to fix up 208 other failing cases (alright, it only took one code change, but still <G>).
          Hide
          Erick Erickson added a comment -

          fixes one more test, and breaks out classes.

          Show
          Erick Erickson added a comment - fixes one more test, and breaks out classes.
          Hide
          Erick Erickson added a comment -

          All tests pass, also found another intrusion of DOM in CoreContainer.xml and removed it.

          CoreContainer.xml should now be entirely free of anything referencing any XML.

          Show
          Erick Erickson added a comment - All tests pass, also found another intrusion of DOM in CoreContainer.xml and removed it. CoreContainer.xml should now be entirely free of anything referencing any XML.
          Hide
          Erick Erickson added a comment -

          Latest version. This has enough of the methods for the properties case to allow at least a very simple test with using a properties file rather than a solr.xml file to work. Nowhere near complete testing of using the properties file, a start though.

          I did violence to CoreDescriptor, having individual setters and getters when we move to properties-based files seems unnecessary, so I combined the individual fields into a Properties member var. I could be talked out of that, but it seems like it's going in the right direction.

          I think at this point we can see what this whole clumsy intermediate form for back-compat will look like, gives me more sympathy for folks maintaining the back-compat layers in Lucene <G>...

          I was wondering if there's a way to wire in randomly using the properties case or the xml-based case to leverage all of the tests built up over the years and all the assumptions they make... haven't looked at how to do that at all though....

          Show
          Erick Erickson added a comment - Latest version. This has enough of the methods for the properties case to allow at least a very simple test with using a properties file rather than a solr.xml file to work. Nowhere near complete testing of using the properties file, a start though. I did violence to CoreDescriptor, having individual setters and getters when we move to properties-based files seems unnecessary, so I combined the individual fields into a Properties member var. I could be talked out of that, but it seems like it's going in the right direction. I think at this point we can see what this whole clumsy intermediate form for back-compat will look like, gives me more sympathy for folks maintaining the back-compat layers in Lucene <G>... I was wondering if there's a way to wire in randomly using the properties case or the xml-based case to leverage all of the tests built up over the years and all the assumptions they make... haven't looked at how to do that at all though....
          Hide
          Erick Erickson added a comment -

          Latest version, getting closer to committing. This version adds some basic tests for using properties for each of the cores, and discovers cores on disk.

          I'm going to try to add tests for persisting and swapping cores next, I think that's probably the place I've punted most. We'll need to tests to create a core, modify it, swap them etc. next. At this point core manipulation is the weakest part.

          Show
          Erick Erickson added a comment - Latest version, getting closer to committing. This version adds some basic tests for using properties for each of the cores, and discovers cores on disk. I'm going to try to add tests for persisting and swapping cores next, I think that's probably the place I've punted most. We'll need to tests to create a core, modify it, swap them etc. next. At this point core manipulation is the weakest part.
          Hide
          Erick Erickson added a comment -

          Latest rev, includes persistence for solr.properties and individual core files.

          Need to add persitence for transient cores that are not in memory when persist is called, that'll be coming soon.

          Show
          Erick Erickson added a comment - Latest rev, includes persistence for solr.properties and individual core files. Need to add persitence for transient cores that are not in memory when persist is called, that'll be coming soon.
          Hide
          Andy Fowler added a comment -

          I noticed a bug in Solr-4.1 release that if there are unloaded transient cores in solr.xml, and a new core is created via the admin handler, the record of the core in solr.xml is removed on persist. From some poking around at a few different issues, it seems that you're aware of this? Really excited about your work on LRU core management, Erick!

          Show
          Andy Fowler added a comment - I noticed a bug in Solr-4.1 release that if there are unloaded transient cores in solr.xml, and a new core is created via the admin handler, the record of the core in solr.xml is removed on persist. From some poking around at a few different issues, it seems that you're aware of this? Really excited about your work on LRU core management, Erick!
          Hide
          Erick Erickson added a comment -

          Andy:

          This patch is getting big so I'm starting to defer "stuff that comes up" to new JIRA's just for bookeeping purposes. Could I ask you to put in a new JIRA for this and assign it to me (if you can). If you can't assign it to me, I get notifications when new JIRAs come by and I'll grab it....

          Also, if you're willing, it'd be great if you could exercise this stuff when I merge it back into 4.x, nothing like having someone hit my code to flush out assumptions and such.

          Thanks
          Erick

          Show
          Erick Erickson added a comment - Andy: This patch is getting big so I'm starting to defer "stuff that comes up" to new JIRA's just for bookeeping purposes. Could I ask you to put in a new JIRA for this and assign it to me (if you can). If you can't assign it to me, I get notifications when new JIRAs come by and I'll grab it.... Also, if you're willing, it'd be great if you could exercise this stuff when I merge it back into 4.x, nothing like having someone hit my code to flush out assumptions and such. Thanks Erick
          Hide
          Erick Erickson added a comment -

          Latest iteration, got rid of the thunking class, it helped me sort things out but I didn't want to leave it, it's now an interface implemented by the old and new style classes. It should fade away when we obsolete solr.xml.

          Still having one problem with the tests. TestSolrProperties.testPersistTrue runs in IntelliJ, but fails from ant, apparently the file paths aren't quite done the same inside and out. The reported error is that I'm not closing my cores enough....

          Show
          Erick Erickson added a comment - Latest iteration, got rid of the thunking class, it helped me sort things out but I didn't want to leave it, it's now an interface implemented by the old and new style classes. It should fade away when we obsolete solr.xml. Still having one problem with the tests. TestSolrProperties.testPersistTrue runs in IntelliJ, but fails from ant, apparently the file paths aren't quite done the same inside and out. The reported error is that I'm not closing my cores enough....
          Hide
          Erick Erickson added a comment -

          Fixed the test that failed, silly error in writing the test.

          This is perilously close to being ready to commit, I'd appreciate it if anyone who has a chance can take a look. There are a lot of changes, so I expect to check it in to trunk and let it bake for a bit before merging to 4.x.

          What worries me most is that I had to change how solr.xml is handled. If the new functionality is broken that's one thing, but if old functionality is broken that's more serious. Any comments about how I handled solr.xml (the current way we do things) are welcome. I suppose if people are very kind they could apply the patch (trunk, haven't tried to apply it to 4.x yet) and beast it.

          The test cases work, what I want to do now is spin up a whole bunch of cores on my machine and just try it just running Solr rather than from the test harness, merrily swapping cores in and out, updating to non-present cores, then querying them etc. Shouldn't be too difficult, and if that works I'll be committing to trunk first.

          There are still several JIRAs that need to be completed before this is "feature complete", but I think by far most of the work is contained in this JIRA and SOLR-1028 (this latter is already committed).

          Show
          Erick Erickson added a comment - Fixed the test that failed, silly error in writing the test. This is perilously close to being ready to commit, I'd appreciate it if anyone who has a chance can take a look. There are a lot of changes, so I expect to check it in to trunk and let it bake for a bit before merging to 4.x. What worries me most is that I had to change how solr.xml is handled. If the new functionality is broken that's one thing, but if old functionality is broken that's more serious. Any comments about how I handled solr.xml (the current way we do things) are welcome. I suppose if people are very kind they could apply the patch (trunk, haven't tried to apply it to 4.x yet) and beast it. The test cases work, what I want to do now is spin up a whole bunch of cores on my machine and just try it just running Solr rather than from the test harness, merrily swapping cores in and out, updating to non-present cores, then querying them etc. Shouldn't be too difficult, and if that works I'll be committing to trunk first. There are still several JIRAs that need to be completed before this is "feature complete", but I think by far most of the work is contained in this JIRA and SOLR-1028 (this latter is already committed).
          Hide
          Mark Miller added a comment -

          I'm going to take a closer look at the latest patch.

          Have you added tests for the new core loading style? I think we want to heavily exercise that in tests before committing.

          Started to glance over the patch and saw some stylistic stuff - eg class members starting with underscore - we should make that consistent with the majority of the code base.

          I'm also seeing some javadoc warnings to fix.

          From a high level, it all looks pretty good to me. I'll try and take some time to dig in over the next couple days though.

          Show
          Mark Miller added a comment - I'm going to take a closer look at the latest patch. Have you added tests for the new core loading style? I think we want to heavily exercise that in tests before committing. Started to glance over the patch and saw some stylistic stuff - eg class members starting with underscore - we should make that consistent with the majority of the code base. I'm also seeing some javadoc warnings to fix. From a high level, it all looks pretty good to me. I'll try and take some time to dig in over the next couple days though.
          Hide
          Erick Erickson added a comment -

          Mark:

          But I like underscores. Sigggh, I'll take them out. Haven't done the precommit thing yet since I'm not ready quite yet so I'm sure there are a bunch of things like javadocs.

          Funny you should talk about stressing it, I spent most of the day working on exactly that, see below. I'm planning to let the stress test run all night tonight...

          Here's the latest version.

          I've also attached a test program that creates however many cores you want, then fires off indexing and query threads that use random cores. See the discussion in the StressTest.java file. It does do some recursive file deletes to clean the cores it created last time, but unless you have your directories names with SOLR4196 in them, you should be safe enough.

          There's a whole lotta synchronization going on, I'm not especially comfortable with it all, but the fundamental difference is that the cores come and go far more rapidly than they used to. In fact I spent part of today chasing down a deadlock that would regularly appear after 20 minutes or so. The stress test ran for an hour no problem, but I'm going to run it all night tonight. But the cause I found really did look like it was a strong possibility, but whether there are other gremlins out there I'm not sure.

          There's one failing test at this point, I'm too bushed to chase it down tonight.

          Show
          Erick Erickson added a comment - Mark: But I like underscores. Sigggh, I'll take them out. Haven't done the precommit thing yet since I'm not ready quite yet so I'm sure there are a bunch of things like javadocs. Funny you should talk about stressing it, I spent most of the day working on exactly that, see below. I'm planning to let the stress test run all night tonight... Here's the latest version. I've also attached a test program that creates however many cores you want, then fires off indexing and query threads that use random cores. See the discussion in the StressTest.java file. It does do some recursive file deletes to clean the cores it created last time, but unless you have your directories names with SOLR4196 in them, you should be safe enough. There's a whole lotta synchronization going on, I'm not especially comfortable with it all, but the fundamental difference is that the cores come and go far more rapidly than they used to. In fact I spent part of today chasing down a deadlock that would regularly appear after 20 minutes or so. The stress test ran for an hour no problem, but I'm going to run it all night tonight. But the cause I found really did look like it was a strong possibility, but whether there are other gremlins out there I'm not sure. There's one failing test at this point, I'm too bushed to chase it down tonight.
          Hide
          Erick Erickson added a comment -

          Cleaned up the precommit warnings and other stylistic issues as per Mark,s comments. Fixed an unpleasant test case failure I was having with persisting solr.xml, which caused me to refactor some of that code.

          Mark: Yes, there are tests for the new core loading style, see TestCoreProperties. Which unfortunately has the same name as another file in another package, I may change it...

          But besides that, this patch correlates with a note I just sent to the dev list about a lockup I'm seeing. My stress tester hits it quite regularly. I've attached an updated version of the stress test code as well.

          Any help highly welcome! Synchronization is always a pain....

          Show
          Erick Erickson added a comment - Cleaned up the precommit warnings and other stylistic issues as per Mark,s comments. Fixed an unpleasant test case failure I was having with persisting solr.xml, which caused me to refactor some of that code. Mark: Yes, there are tests for the new core loading style, see TestCoreProperties. Which unfortunately has the same name as another file in another package, I may change it... But besides that, this patch correlates with a note I just sent to the dev list about a lockup I'm seeing. My stress tester hits it quite regularly. I've attached an updated version of the stress test code as well. Any help highly welcome! Synchronization is always a pain....
          Hide
          Erick Erickson added a comment -

          Added the ability to create the old-style solr.xml program rather than the new-style properties file for testing against current trunk. See -x param.

          Show
          Erick Erickson added a comment - Added the ability to create the old-style solr.xml program rather than the new-style properties file for testing against current trunk. See -x param.
          Hide
          Mark Miller added a comment -

          Looking pretty good!

          Another clean up comment - perhaps the following:

          solr.persistent=true
          solr.cores.defaultCoreName=00000_SOLR4196
          solr.cores.transientCacheSize=20
          solr.cores.adminPath=/admin/cores

          should have solr. taken out. It's already scoped to solr really.

          Show
          Mark Miller added a comment - Looking pretty good! Another clean up comment - perhaps the following: solr.persistent=true solr.cores.defaultCoreName=00000_SOLR4196 solr.cores.transientCacheSize=20 solr.cores.adminPath=/admin/cores should have solr. taken out. It's already scoped to solr really.
          Hide
          Mark Miller added a comment -

          Also, fyi, for some reason eclipse doesn't like your latest patch - keeps claiming it's not valid...

          Show
          Mark Miller added a comment - Also, fyi, for some reason eclipse doesn't like your latest patch - keeps claiming it's not valid...
          Hide
          Mark Miller added a comment -

          The first deadlock I've seen is with -x and involves many core gets and registers. Looks like they are all waiting on the cores lock.

          Show
          Mark Miller added a comment - The first deadlock I've seen is with -x and involves many core gets and registers. Looks like they are all waiting on the cores lock.
          Hide
          Erick Erickson added a comment -

          Hmmm, I have no idea why Eclipse didn't like the patch, I just generated it with svn diff. But it I'm guessing you have it applied anyway.

          Sounds like we should raise a new JIRA about the deadlock issue, so I raised SOLR-4400 since it doesn't look like it's related to this particular patch... Certainly SOLR-1028 made the stress test possible so we're seeing it.

          NOTE: I'm pretty sure this applies to 4x too, I'm running that test now, will record it in SOLR-4400

          Show
          Erick Erickson added a comment - Hmmm, I have no idea why Eclipse didn't like the patch, I just generated it with svn diff. But it I'm guessing you have it applied anyway. Sounds like we should raise a new JIRA about the deadlock issue, so I raised SOLR-4400 since it doesn't look like it's related to this particular patch... Certainly SOLR-1028 made the stress test possible so we're seeing it. NOTE: I'm pretty sure this applies to 4x too, I'm running that test now, will record it in SOLR-4400
          Hide
          Mark Miller added a comment -

          Hmmm, I have no idea why Eclipse didn't like the patch,

          I mention it because previous patches applied no problem, so I wasn't sure if something was done differently. I almost never find a patch eclipse won't apply for me.

          Show
          Mark Miller added a comment - Hmmm, I have no idea why Eclipse didn't like the patch, I mention it because previous patches applied no problem, so I wasn't sure if something was done differently. I almost never find a patch eclipse won't apply for me.
          Hide
          Erick Erickson added a comment -

          I'm really starting to hate this patch. All junit tests except one run. And the lucky winner is.....
          BasicDistributedZkTest.testDistribSearch. It fails on my mac first time, every time. Which may be a good thing if it means we can get to the bottom of this one. Yeah, I'm reaching...

          repeated message:
          HEARTBEAT J0 PID(11453@Ericks-MacBook-Pro.local): 2013-02-04T20:25:21, stalled for ###s at: BasicDistributedZkTest.testDistribSearch

          I really won't have any more time to devote to this patch this week, so if anybody feels ambitious please feel free. I'd be happy to provide stack traces, logs, or test things, but I've got a bunch of other stuff stacked up.

          SolrCloud comes up fine when following the Wiki FWIW.

          Show
          Erick Erickson added a comment - I'm really starting to hate this patch. All junit tests except one run. And the lucky winner is..... BasicDistributedZkTest.testDistribSearch. It fails on my mac first time, every time. Which may be a good thing if it means we can get to the bottom of this one. Yeah, I'm reaching... repeated message: HEARTBEAT J0 PID(11453@Ericks-MacBook-Pro.local): 2013-02-04T20:25:21, stalled for ###s at: BasicDistributedZkTest.testDistribSearch I really won't have any more time to devote to this patch this week, so if anybody feels ambitious please feel free. I'd be happy to provide stack traces, logs, or test things, but I've got a bunch of other stuff stacked up. SolrCloud comes up fine when following the Wiki FWIW.
          Hide
          Mark Miller added a comment -

          I'll try some test runs with the latest patch soon.

          Show
          Mark Miller added a comment - I'll try some test runs with the latest patch soon.
          Hide
          Erick Erickson added a comment -

          I'm still having the same problem with updates as of this morning. I spent some time trying to make sense of the stack traces and output, but no luck yet. I hacked in some code to in sure that all of the locking in CoreContainer is paired with a release, and they apparently are. I don't see any suspicious blocking in the stack traces.

          I did uncomment a zk state dump in ElectionContext.waitForReplicasToComeUp and see a bunch of statements that indicate that multiunload* has failed to recover. There's also the message:

          [junit4:junit4] 2> 148785 T324 oasc.SolrException.log SEVERE org.apache.solr.common.SolrException: core not found:multiunload1
          [junit4:junit4] 2> at org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:905)
          [junit4:junit4] 2> at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:190)

          Whether this is all related, or whether it's even germane...well...I'm getting lost in the data.

          Curiously, we don't see to be stuck in the loop in ElectionContext.waitForReplicasToComeUp. In my entire log output, there are only 3 statements like:
          ShardLeaderElectionContext.waitForReplicasToComeUp Waiting until we see more replicas up: total=2 found=1 timeoutin=179999
          and the timeoutin number never does the countdown it used to do.

          I added a log statement in RcoveryStrategy, line 451 (retries = INTERRUPTED and that clause is definitely getting hit.

          Anyway, enough for a Sunday afternoon, I'll probably look at this a bit more tonight...

          Show
          Erick Erickson added a comment - I'm still having the same problem with updates as of this morning. I spent some time trying to make sense of the stack traces and output, but no luck yet. I hacked in some code to in sure that all of the locking in CoreContainer is paired with a release, and they apparently are. I don't see any suspicious blocking in the stack traces. I did uncomment a zk state dump in ElectionContext.waitForReplicasToComeUp and see a bunch of statements that indicate that multiunload* has failed to recover. There's also the message: [junit4:junit4] 2> 148785 T324 oasc.SolrException.log SEVERE org.apache.solr.common.SolrException: core not found:multiunload1 [junit4:junit4] 2> at org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:905) [junit4:junit4] 2> at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:190) Whether this is all related, or whether it's even germane...well...I'm getting lost in the data. Curiously, we don't see to be stuck in the loop in ElectionContext.waitForReplicasToComeUp. In my entire log output, there are only 3 statements like: ShardLeaderElectionContext.waitForReplicasToComeUp Waiting until we see more replicas up: total=2 found=1 timeoutin=179999 and the timeoutin number never does the countdown it used to do. I added a log statement in RcoveryStrategy, line 451 (retries = INTERRUPTED and that clause is definitely getting hit. Anyway, enough for a Sunday afternoon, I'll probably look at this a bit more tonight...
          Hide
          Erick Erickson added a comment -

          OK, I finally figured out what was up with the failing BasicDistributedZkTest. Unfortunately, it was entirely due to the code changes in the rest of this patch, big surprise that <G>. Persisting solr.xml is a pain.

          I doubt it had anything to do with the problem that seems to come up intermittently w/ that test but w/o this patch. Rats, I was hoping for a twofer.

          At any rate, after I run precommit now and hammer this patch with the stress test program, I intend to check this in to trunk sometime this week, let it bake for a while, and then merge it in to 4x.

          Next up is probably sharing schemas, maybe adding the stress test to the junit tests.

          Show
          Erick Erickson added a comment - OK, I finally figured out what was up with the failing BasicDistributedZkTest. Unfortunately, it was entirely due to the code changes in the rest of this patch, big surprise that <G>. Persisting solr.xml is a pain. I doubt it had anything to do with the problem that seems to come up intermittently w/ that test but w/o this patch. Rats, I was hoping for a twofer. At any rate, after I run precommit now and hammer this patch with the stress test program, I intend to check this in to trunk sometime this week, let it bake for a while, and then merge it in to 4x. Next up is probably sharing schemas, maybe adding the stress test to the junit tests.
          Hide
          Mark Miller added a comment -

          with the problem that seems to come up intermittently

          Are you seeing this on your machine? If so, please file or add to a JIRA issue for that test with the details.

          Show
          Mark Miller added a comment - with the problem that seems to come up intermittently Are you seeing this on your machine? If so, please file or add to a JIRA issue for that test with the details.
          Hide
          Erick Erickson added a comment -

          bq: Are you seeing this on your machine? If so, please file or add to a JIRA issue for that test with the details.

          Nope, just going from the emails to the dev list from Apache. Or maybe I'm remembering from some time ago....

          Show
          Erick Erickson added a comment - bq: Are you seeing this on your machine? If so, please file or add to a JIRA issue for that test with the details. Nope, just going from the emails to the dev list from Apache. Or maybe I'm remembering from some time ago....
          Hide
          Mark Miller added a comment -

          Unless you see something fail locally without your changes, I wouldn't assume a test fail is unrelated. The Apache Jenkins fails on freebsd with blackhole are a whole different ballgame.

          Show
          Mark Miller added a comment - Unless you see something fail locally without your changes, I wouldn't assume a test fail is unrelated. The Apache Jenkins fails on freebsd with blackhole are a whole different ballgame.
          Hide
          Erick Erickson added a comment -

          I'm not communicating. There aren't any test failures on my machine so far. I was just hoping, vainly, that somehow re-arranging the CoreContainer code would expose whatever's been happening on Jenkins based solely on similarity of error messages. Turns out, as you say, that probably isn't the case.

          Tests with the changes in the latest patch are running just fine, at least once. I'll give them another couple of whirls after I've run through a few hours of the stress tests.

          Show
          Erick Erickson added a comment - I'm not communicating. There aren't any test failures on my machine so far. I was just hoping, vainly, that somehow re-arranging the CoreContainer code would expose whatever's been happening on Jenkins based solely on similarity of error messages. Turns out, as you say, that probably isn't the case. Tests with the changes in the latest patch are running just fine, at least once. I'll give them another couple of whirls after I've run through a few hours of the stress tests.
          Hide
          Mark Miller added a comment -

          I'm just letting you know for the future - I know you have brought up that test a couple times, and I'd be very interested in the fail if it was something you are seeing locally.

          But if you are just going by what apache jenkins is reporting, that is a different ball of wax, and I have all the info I need.

          So you should be able to trust those tests locally - and if you can't, that is something we should fix.

          Show
          Mark Miller added a comment - I'm just letting you know for the future - I know you have brought up that test a couple times, and I'd be very interested in the fail if it was something you are seeing locally. But if you are just going by what apache jenkins is reporting, that is a different ball of wax, and I have all the info I need. So you should be able to trust those tests locally - and if you can't, that is something we should fix.
          Hide
          Erick Erickson added a comment -

          got it, thanks.

          Show
          Erick Erickson added a comment - got it, thanks.
          Hide
          Erick Erickson added a comment -

          Also attached is latest jar of the stress test program as well as sources.

          Show
          Erick Erickson added a comment - Also attached is latest jar of the stress test program as well as sources.
          Hide
          Erick Erickson added a comment -

          Oops, fat fingers, sent too soon. Jar file is too big (zips up SolJ stuff), so src attached in zip file

          Show
          Erick Erickson added a comment - Oops, fat fingers, sent too soon. Jar file is too big (zips up SolJ stuff), so src attached in zip file
          Hide
          Erick Erickson added a comment -

          Final version. I plan to commit this today or tomorrow, let it bake for a bit and merge into 4x unless there are objections. The fix for SOLR-4505 took care of the deadlocks I was seeing in the tests...

          Show
          Erick Erickson added a comment - Final version. I plan to commit this today or tomorrow, let it bake for a bit and merge into 4x unless there are objections. The fix for SOLR-4505 took care of the deadlocks I was seeing in the tests...
          Hide
          Erick Erickson added a comment -

          Checked in, trunk r: 1451797

          I want this to bake for a few days, then I'll merge it into 4x. Probably early next week unless things blow up.

          Show
          Erick Erickson added a comment - Checked in, trunk r: 1451797 I want this to bake for a few days, then I'll merge it into 4x. Probably early next week unless things blow up.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Erick Erickson
          http://svn.apache.org/viewvc?view=revision&revision=1451797

          SOLR-4196, steps toward making solr.xml obsolete

          Show
          Commit Tag Bot added a comment - [trunk commit] Erick Erickson http://svn.apache.org/viewvc?view=revision&revision=1451797 SOLR-4196 , steps toward making solr.xml obsolete
          Hide
          Erick Erickson added a comment -

          Merged into 4x, r: 1454477

          Show
          Erick Erickson added a comment - Merged into 4x, r: 1454477
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Erick Erickson
          http://svn.apache.org/viewvc?view=revision&revision=1454477

          Merging code for SOLR-4196, SOLR-4401, SOLR-4525. All about obsoleting solr.xml and supporting a large number of cores

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Erick Erickson http://svn.apache.org/viewvc?view=revision&revision=1454477 Merging code for SOLR-4196 , SOLR-4401 , SOLR-4525 . All about obsoleting solr.xml and supporting a large number of cores
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Erick Erickson
          http://svn.apache.org/viewvc?view=revision&revision=1456938

          Added comments for deprecating solr.xml (SOLR-4196 etc)

          Show
          Commit Tag Bot added a comment - [trunk commit] Erick Erickson http://svn.apache.org/viewvc?view=revision&revision=1456938 Added comments for deprecating solr.xml ( SOLR-4196 etc)
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Erick Erickson
          http://svn.apache.org/viewvc?view=revision&revision=1456941

          Added comments for deprecating solr.xml (SOLR-4196 etc)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Erick Erickson http://svn.apache.org/viewvc?view=revision&revision=1456941 Added comments for deprecating solr.xml ( SOLR-4196 etc)
          Hide
          Mark Miller added a comment -

          I think there are some issues to look at here - there are a variety of non thread safe accesses to shared variables. That's not really a valid way to avoid deadlocks.

          Show
          Mark Miller added a comment - I think there are some issues to look at here - there are a variety of non thread safe accesses to shared variables. That's not really a valid way to avoid deadlocks.
          Hide
          Mark Miller added a comment -

          I'll try and be more specific over the weekend when I can look closer.

          Show
          Mark Miller added a comment - I'll try and be more specific over the weekend when I can look closer.
          Hide
          Erick Erickson added a comment -

          bq: ...there are a variety of non thread safe accesses to shared variables. That's not really a valid way to avoid deadlocks.

          Hmmm, where? I certainly didn't do this intentionally... but there's a lot of code here. Note that quite a bit of CoreContainer historically assumed that it was the only thread that was active, so some of this may be leftovers. But certainly the more eyes that look at this the better.

          Show
          Erick Erickson added a comment - bq: ...there are a variety of non thread safe accesses to shared variables. That's not really a valid way to avoid deadlocks. Hmmm, where? I certainly didn't do this intentionally... but there's a lot of code here. Note that quite a bit of CoreContainer historically assumed that it was the only thread that was active, so some of this may be leftovers. But certainly the more eyes that look at this the better.
          Hide
          Mark Miller added a comment -

          I'm going to do a full review soon and I'll report back.

          Show
          Mark Miller added a comment - I'm going to do a full review soon and I'll report back.
          Hide
          Mark Miller added a comment -

          what im talking about is this:

            // We are shutting down. We don't want to risk deadlock, so do this manipulation the expensive way. Note, I've
            // already deadlocked with closing/opening cores while keeping locks here....
            protected void clearMaps(ConfigSolr cfg) {
          

          And then below it clears cores and transient cores (both vars shared across threads) without a lock. This is the type of thing that will need to be fixed if we are to get rid of those fails in the stress test is my guess.

          Show
          Mark Miller added a comment - what im talking about is this: // We are shutting down. We don't want to risk deadlock, so do this manipulation the expensive way. Note, I've // already deadlocked with closing/opening cores while keeping locks here.... protected void clearMaps(ConfigSolr cfg) { And then below it clears cores and transient cores (both vars shared across threads) without a lock. This is the type of thing that will need to be fixed if we are to get rid of those fails in the stress test is my guess.
          Hide
          Erick Erickson added a comment -

          Good catch!

          Actually, I think clearing the maps is wrong too, as the cores are closed the individual entry should be removed from the relevant map. I've put that change into the code I'm working on now.....

          Show
          Erick Erickson added a comment - Good catch! Actually, I think clearing the maps is wrong too, as the cores are closed the individual entry should be removed from the relevant map. I've put that change into the code I'm working on now.....
          Hide
          Erick Erickson added a comment -

          Changed my mind, rather than fold it into SOLR-4615, I've already got a "stabilize test" JIRA, SOLR-4549 that I can work on independently. I'll put this fix in 4549 which might accumulate a few more just as soon as I find the time... Unlike certain young people I need some sleep occasionally... Siiigggghhh.

          Show
          Erick Erickson added a comment - Changed my mind, rather than fold it into SOLR-4615 , I've already got a "stabilize test" JIRA, SOLR-4549 that I can work on independently. I'll put this fix in 4549 which might accumulate a few more just as soon as I find the time... Unlike certain young people I need some sleep occasionally... Siiigggghhh.
          Hide
          Erick Erickson added a comment -

          Folding any further work into SOLR-4662 and/or SOLR-4549.

          Show
          Erick Erickson added a comment - Folding any further work into SOLR-4662 and/or SOLR-4549 .
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Erick Erickson
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development