Solr
  1. Solr
  2. SOLR-1028

Automatic core loading unloading for multicore

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0, 5.0
    • Fix Version/s: 4.1, 5.0
    • Component/s: multicore
    • Labels:
      None

      Description

      usecase: I have many small cores (say one per user) on a single Solr box . All the cores are not be always needed . But when I need it I should be able to directly issue a search request and the core must be STARTED automatically and the request must be served.

      This also requires that I must have an upper limit on the no:of cores that should be loaded at any given point in time. If the limit is crossed the CoreContainer must unload a core (preferably the least recently used core)

      There must be a choice of specifying some cores as fixed. These cores must never be unloaded

      1. jenkins.jpg
        59 kB
        Dawid Weiss
      2. SOLR-1028_testnoise.patch
        2 kB
        Robert Muir
      3. SOLR-1028.patch
        50 kB
        Erick Erickson
      4. SOLR-1028.patch
        51 kB
        Erick Erickson

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hide
          Erick Erickson added a comment -

          Patch attached. This will subsume SOLR-880, keeping them separate was more trouble than it was worth.

          I'd like to commit this to trunk real soon now and let it bake a while before merging into 4.x. At least let it bake until after 4.0.1 or 4.1 (whatever) is cut.

          Show
          Erick Erickson added a comment - Patch attached. This will subsume SOLR-880 , keeping them separate was more trouble than it was worth. I'd like to commit this to trunk real soon now and let it bake a while before merging into 4.x. At least let it bake until after 4.0.1 or 4.1 (whatever) is cut.
          Hide
          Erick Erickson added a comment -

          Here are the comments I meant to include.

          1> implementa loadOnStartup (default true) as a <core...> attribute. Causes the core to be loaded when Solr starts, just like it does now.
          2> implements swappable (default false) as a <core...> attribute. Allows Solr to automatically unload the core if it runs out of room in a (new) LRU core list.
          3> implements a swappableCacheSize attribute for <cores...> (default unlimited) for how large the list of "swappable" cores can grow to before they're closed and swapped out.
          4> Wraps a bunch of exceptions in CoreContainer into a SolrException (as per comments in that code).
          5> persists these two new <core> attributes.
          6> persists the swappableCacheSize attribute of <cores> if set.

          All the tests run, so it'll be interesting to see what the build machine thinks. In all its forms.

          The cost of this functionality in the usual case (i.e. for current users) is, I believe, only a single check in CoreContainer.getCore when the core name isn't already found in the usual" list.

          Show
          Erick Erickson added a comment - Here are the comments I meant to include. 1> implementa loadOnStartup (default true) as a <core...> attribute. Causes the core to be loaded when Solr starts, just like it does now. 2> implements swappable (default false) as a <core...> attribute. Allows Solr to automatically unload the core if it runs out of room in a (new) LRU core list. 3> implements a swappableCacheSize attribute for <cores...> (default unlimited) for how large the list of "swappable" cores can grow to before they're closed and swapped out. 4> Wraps a bunch of exceptions in CoreContainer into a SolrException (as per comments in that code). 5> persists these two new <core> attributes. 6> persists the swappableCacheSize attribute of <cores> if set. All the tests run, so it'll be interesting to see what the build machine thinks. In all its forms. The cost of this functionality in the usual case (i.e. for current users) is, I believe, only a single check in CoreContainer.getCore when the core name isn't already found in the usual" list.
          Hide
          Mark Miller added a comment -

          Kind of an aside to this specific issue, but in relation to all of these massive number of SolrCore issues: I'm not sure what I think about all this multi-core work if it's not going to jive with SolrCloud. Trying to develop Solr as two different systems is too costly in the long term IMO. My feeling has been that we would push away from the std mode and always run in SolrCloud mode eventually - it's just a matter of waiting until it's ready. And so introducing a lot of new work and code that is not compatible with SolrCloud makes me think.

          A lot of the ugly warts that are left now are around because we are straddling two different ways of doing things. The path that makes it all much nicer involves narrowing down to one way of doing things I think.

          Not quite the right place for the discussion here, but probably one that we should have.

          Show
          Mark Miller added a comment - Kind of an aside to this specific issue, but in relation to all of these massive number of SolrCore issues: I'm not sure what I think about all this multi-core work if it's not going to jive with SolrCloud. Trying to develop Solr as two different systems is too costly in the long term IMO. My feeling has been that we would push away from the std mode and always run in SolrCloud mode eventually - it's just a matter of waiting until it's ready. And so introducing a lot of new work and code that is not compatible with SolrCloud makes me think. A lot of the ugly warts that are left now are around because we are straddling two different ways of doing things. The path that makes it all much nicer involves narrowing down to one way of doing things I think. Not quite the right place for the discussion here, but probably one that we should have.
          Hide
          Erick Erickson added a comment -

          Yeah, I've punted entirely on the zookeeper case so far, and I've been wondering about how the two will come together. That said, I suspect SolrCloud will make use of most of this infrastructure when they do come together, the use-case is how to provide a way to automatically load/unload cores when you have a bunch of them resident but only want to have a subset "live" at any one time. The use-case I'm working on here is on the order of 10,000 cores with, maybe, 100-200 active at any one time. My really fuzzy conception is that if we can get SolrCloud to take the place of the "CoreDescriptorProvider" mechanism in this kind of case, it'll "just work". How all that plays with replicas/leaders/ZK the ZK election process and all of that is...er...not clear to me...

          FWIW, I hacked the code as an experiment to make it used in all the SolrCloud tests (without the hack, anything that's ZK sensitive doesn't use this code path) and was pleasantly surprised by how few of them failed, so I think reconciling the two ways of doing things won't be horrible. Always room for surprises of course.

          As you said, though, I suspect we'll be hashing this all out in the "how to bring this all together" discussion, wherever we have it.

          Show
          Erick Erickson added a comment - Yeah, I've punted entirely on the zookeeper case so far, and I've been wondering about how the two will come together. That said, I suspect SolrCloud will make use of most of this infrastructure when they do come together, the use-case is how to provide a way to automatically load/unload cores when you have a bunch of them resident but only want to have a subset "live" at any one time. The use-case I'm working on here is on the order of 10,000 cores with, maybe, 100-200 active at any one time. My really fuzzy conception is that if we can get SolrCloud to take the place of the "CoreDescriptorProvider" mechanism in this kind of case, it'll "just work". How all that plays with replicas/leaders/ZK the ZK election process and all of that is...er...not clear to me... FWIW, I hacked the code as an experiment to make it used in all the SolrCloud tests (without the hack, anything that's ZK sensitive doesn't use this code path) and was pleasantly surprised by how few of them failed, so I think reconciling the two ways of doing things won't be horrible. Always room for surprises of course. As you said, though, I suspect we'll be hashing this all out in the "how to bring this all together" discussion, wherever we have it.
          Hide
          Noble Paul added a comment -

          I agree with Mark on this.  We should not make a feature completely incompatible with SolrCloud . The ugly thing I see in SolrCloud is the single huge file stored in zk . It's not suitable for tens of thousands of cores.  We will easily run into multiple megabytes

          Show
          Noble Paul added a comment - I agree with Mark on this.  We should not make a feature completely incompatible with SolrCloud . The ugly thing I see in SolrCloud is the single huge file stored in zk . It's not suitable for tens of thousands of cores.  We will easily run into multiple megabytes
          Hide
          Erick Erickson added a comment -

          Committed trunk, r: 1403710

          I want to let this bake a while before merging into 4.x, until 4.1/4.0.1 gets cut at least.

          Show
          Erick Erickson added a comment - Committed trunk, r: 1403710 I want to let this bake a while before merging into 4.x, until 4.1/4.0.1 gets cut at least.
          Hide
          Hoss Man added a comment -

          Erick: note the following recent jenkins failure...

          java.lang.AssertionError: reload exception doesn't refer to prolog 프롤로그에서는 콘텐츠가 허용되지 않습니다.
                  at __randomizedtesting.SeedInfo.seed([F661F6E616E4E41:959CD4968591C045]:0)
                  at org.junit.Assert.fail(Assert.java:93)
                  at org.junit.Assert.assertTrue(Assert.java:43)
                  at org.apache.solr.core.CoreContainerCoreInitFailuresTest.testFlowBadFromStart(CoreContainerCoreInitFailuresTest.java:254)
          
          ant test  -Dtestcase=CoreContainerCoreInitFailuresTest -Dtests.method=testFlowBadFromStart -Dtests.seed=F661F6E616E4E41 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ko -Dtests.timezone=GB -Dtests.file.encoding=UTF-8
          

          The problem seems to be that when you cleaned up what exceptions get thrown, you also changed what logic is used in the test to verify that the expected error was generated so that it relies on a specific (english) string "prolog", and when we run in some locals that specific string is not included in the wraped XML exceptions - hence the previous usage of e.getSystemId().indexOf("solrconfig.xml") in that test since that is a garunteed property of the exception that is in our control.

          Show
          Hoss Man added a comment - Erick: note the following recent jenkins failure... java.lang.AssertionError: reload exception doesn't refer to prolog 프롤로그에서는 콘텐츠가 허용되지 않습니다. at __randomizedtesting.SeedInfo.seed([F661F6E616E4E41:959CD4968591C045]:0) at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.solr.core.CoreContainerCoreInitFailuresTest.testFlowBadFromStart(CoreContainerCoreInitFailuresTest.java:254) ant test -Dtestcase=CoreContainerCoreInitFailuresTest -Dtests.method=testFlowBadFromStart -Dtests.seed=F661F6E616E4E41 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ko -Dtests.timezone=GB -Dtests.file.encoding=UTF-8 The problem seems to be that when you cleaned up what exceptions get thrown, you also changed what logic is used in the test to verify that the expected error was generated so that it relies on a specific (english) string "prolog", and when we run in some locals that specific string is not included in the wraped XML exceptions - hence the previous usage of e.getSystemId().indexOf("solrconfig.xml") in that test since that is a garunteed property of the exception that is in our control.
          Hide
          Erick Erickson added a comment -

          Hoss:

          Thanks for pointing me in the right direction here. How I made that test pass kinda sucked....

          as per our chat, this apparently is a Java7 thing, which is why I can't seem to reproduce.

          Anyway, checked in a fix r: 1403936 that I'll back-port along with this issue.

          Show
          Erick Erickson added a comment - Hoss: Thanks for pointing me in the right direction here. How I made that test pass kinda sucked.... as per our chat, this apparently is a Java7 thing, which is why I can't seem to reproduce. Anyway, checked in a fix r: 1403936 that I'll back-port along with this issue.
          Hide
          Mark Miller added a comment -

          The ugly thing I see in SolrCloud is the single huge file stored in zk . It's not suitable for tens of thousands of cores. We will easily run into multiple megabytes

          Yeah, this is my main concern as well - though it remains to be seen if a few megabytes is an issue on a gigbit or better network.

          We actually did things differently in the past, but that came with it's own issues.

          Show
          Mark Miller added a comment - The ugly thing I see in SolrCloud is the single huge file stored in zk . It's not suitable for tens of thousands of cores. We will easily run into multiple megabytes Yeah, this is my main concern as well - though it remains to be seen if a few megabytes is an issue on a gigbit or better network. We actually did things differently in the past, but that came with it's own issues.
          Hide
          Mark Miller added a comment -

          As you said, though, I suspect we'll be hashing this all out in the "how to bring this all together" discussion, wherever we have it.

          I guess my point is that we should hash things out first. If you are going in a non solrcloud compatible direction with all of these massive core issues, I think that is a mistake. You are adding a bunch of code and complication that will just make it harder to pull away from the old style solr and into a cleaner new world. You may be creating features we will have to continue supporting or rip away from users shortly after they are added. Both things may not be a good idea.

          Developing along the old Solr model and just hoping everything will converge in the end seems like the wrong approach.

          Show
          Mark Miller added a comment - As you said, though, I suspect we'll be hashing this all out in the "how to bring this all together" discussion, wherever we have it. I guess my point is that we should hash things out first. If you are going in a non solrcloud compatible direction with all of these massive core issues, I think that is a mistake. You are adding a bunch of code and complication that will just make it harder to pull away from the old style solr and into a cleaner new world. You may be creating features we will have to continue supporting or rip away from users shortly after they are added. Both things may not be a good idea. Developing along the old Solr model and just hoping everything will converge in the end seems like the wrong approach.
          Hide
          Erick Erickson added a comment -

          I think the main thrust of these changes is compatible with SolrCloud, but the more eyes the merrier. Here's my reasoning:

          > the idea of rapidly opening/closing cores, and limiting the number of concurrently-open cores will have to be handled locally, which is what the bulk of these changes actually are. One bit of noise here will be that I refactored the ZK-case to the local-case, so it may look worse than it is.

          > After I looked a bit more at the ZK instance, I noticed that the SolrCloud stuff also has a new coreDescriptorProvider (I'll reconcile those two as part of SOLR-1306, BTW. There's no reason to have two). So whether the descriptor comes from ZK or comes from "some other place" should (tm) be transparent on the level of these changes.

          > I think the biggest question for me about how ZK interacts with all this is mostly how opening/closing cores is supposed to work during indexing. The whole notion of distributed indexing across a zillion rapidly opening/closing cores on a single machine really seems like something that shouldn't be happening during indexing at all. Or at least a way for users to shoot themselves in the foot. Imagine that you have 10K cores/machine, each with 3 replicas and you're randomly sending updates to those cores. Further imagine that your concurrently open core limit is 100. Throughput would be horrible. I suppose the right solution is that whoever is setting this up (and I assume they're pretty sophisticated) needs to index to a single core at a time until all the updates were sent, then go on to the next core. Or pay the price speed-wise.

          > The other bit I'm not clear about the ZK end is how we keep, say, 10K coreDescriptors in ZK with the 1M limit as has been mentioned. But again I don't think that is incompatible at all with these changes.

          > I don't think all the JIRA's associated with SOLR-1293 need to be addressed. Some of them appear to be already done or have yet to be proved to be helpful. But since they're all local to the Solr instance anyway, I suspect they'll be the same whether in SolrCloud or not.

          > If we go to a model where ZK runs transparently even in the "normal" case, then as long as the CoreDescriptorProvider is pluggable in that situation, I think we're good to go.

          All that said, it would be a Good Thing if anyone can poke holes in my hand-waving before I back myself into a corner. Note that if anyone looks at this, they should look at SOLR-1306 in conjunction with this JIRA. Between the two of them the bulk of the changes I'm thinking about are handled.

          Show
          Erick Erickson added a comment - I think the main thrust of these changes is compatible with SolrCloud, but the more eyes the merrier. Here's my reasoning: > the idea of rapidly opening/closing cores, and limiting the number of concurrently-open cores will have to be handled locally, which is what the bulk of these changes actually are. One bit of noise here will be that I refactored the ZK-case to the local-case, so it may look worse than it is. > After I looked a bit more at the ZK instance, I noticed that the SolrCloud stuff also has a new coreDescriptorProvider (I'll reconcile those two as part of SOLR-1306 , BTW. There's no reason to have two). So whether the descriptor comes from ZK or comes from "some other place" should (tm) be transparent on the level of these changes. > I think the biggest question for me about how ZK interacts with all this is mostly how opening/closing cores is supposed to work during indexing. The whole notion of distributed indexing across a zillion rapidly opening/closing cores on a single machine really seems like something that shouldn't be happening during indexing at all. Or at least a way for users to shoot themselves in the foot. Imagine that you have 10K cores/machine, each with 3 replicas and you're randomly sending updates to those cores. Further imagine that your concurrently open core limit is 100. Throughput would be horrible. I suppose the right solution is that whoever is setting this up (and I assume they're pretty sophisticated) needs to index to a single core at a time until all the updates were sent, then go on to the next core. Or pay the price speed-wise. > The other bit I'm not clear about the ZK end is how we keep, say, 10K coreDescriptors in ZK with the 1M limit as has been mentioned. But again I don't think that is incompatible at all with these changes. > I don't think all the JIRA's associated with SOLR-1293 need to be addressed. Some of them appear to be already done or have yet to be proved to be helpful. But since they're all local to the Solr instance anyway, I suspect they'll be the same whether in SolrCloud or not. > If we go to a model where ZK runs transparently even in the "normal" case, then as long as the CoreDescriptorProvider is pluggable in that situation, I think we're good to go. All that said, it would be a Good Thing if anyone can poke holes in my hand-waving before I back myself into a corner. Note that if anyone looks at this, they should look at SOLR-1306 in conjunction with this JIRA. Between the two of them the bulk of the changes I'm thinking about are handled.
          Hide
          Yonik Seeley added a comment -

          TestLazyCores seems to have failed a number of times since it was added here.
          http://svn.apache.org/viewvc?rev=1403710&view=rev
          Anyone have any clues?

          Show
          Yonik Seeley added a comment - TestLazyCores seems to have failed a number of times since it was added here. http://svn.apache.org/viewvc?rev=1403710&view=rev Anyone have any clues?
          Hide
          Erick Erickson added a comment -

          right, I haven't been able to devote any serious time to this this week, but I should get a chance to work on it tomorrow. Might be a race condition. Last time I tried to reproduce I couldn't.....

          Show
          Erick Erickson added a comment - right, I haven't been able to devote any serious time to this this week, but I should get a chance to work on it tomorrow. Might be a race condition. Last time I tried to reproduce I couldn't.....
          Hide
          Robert Muir added a comment -

          I can't reproduce these failures, even with master seed. The problem seems to be testCachingLimit.

          The rest of it is noise, because the try/finally in testCachingLimit doesn't close the cores it opens.
          This causes the endTrackingSearchers() to fire additional class-level failures which just add to noise.

          Is there any way we can take the endTrackingSearchers/zkClients and instead implement it as a test rule,
          plugging it into the class rules chain, and so it won't do this if the test already failed?

          Show
          Robert Muir added a comment - I can't reproduce these failures, even with master seed. The problem seems to be testCachingLimit. The rest of it is noise, because the try/finally in testCachingLimit doesn't close the cores it opens. This causes the endTrackingSearchers() to fire additional class-level failures which just add to noise. Is there any way we can take the endTrackingSearchers/zkClients and instead implement it as a test rule, plugging it into the class rules chain, and so it won't do this if the test already failed?
          Hide
          Mark Miller added a comment -

          This causes the endTrackingSearchers() to fire additional class-level failures which just add to noise.

          Yeah, that is always really annoying. You almost always have to dig for the true failure because so many failures can keep cores from being closed. I suppose many of those cases can be fixed one by one, but it would be awesome to get the real error and not always bury it under endTrackingSearchers failures.

          Show
          Mark Miller added a comment - This causes the endTrackingSearchers() to fire additional class-level failures which just add to noise. Yeah, that is always really annoying. You almost always have to dig for the true failure because so many failures can keep cores from being closed. I suppose many of those cases can be fixed one by one, but it would be awesome to get the real error and not always bury it under endTrackingSearchers failures.
          Hide
          Robert Muir added a comment -

          Right, i dont think tests should have to have such tricky try/finally/stuff.

          Back in the day when all these checks were piled into LuceneTestCase, there was a boolean 'testsFailed' which these things checked to avoid the noise.

          But now its cleaner with these test rules chains, I just don't know how to factor out these trackers into test rules, and placed in the chain in such a way that they work like the other checks and don't fire if the test already failed.

          Maybe Dawid Weiss knows?

          Show
          Robert Muir added a comment - Right, i dont think tests should have to have such tricky try/finally/stuff. Back in the day when all these checks were piled into LuceneTestCase, there was a boolean 'testsFailed' which these things checked to avoid the noise. But now its cleaner with these test rules chains, I just don't know how to factor out these trackers into test rules, and placed in the chain in such a way that they work like the other checks and don't fire if the test already failed. Maybe Dawid Weiss knows?
          Hide
          Robert Muir added a comment -

          No idea if this patch actually works to help with that noise situation.

          I tried doing it only in afterClass, but this caused a lot of problems, I think maybe some tests depend on the sleeps done here as a side-effect?

          Show
          Robert Muir added a comment - No idea if this patch actually works to help with that noise situation. I tried doing it only in afterClass, but this caused a lot of problems, I think maybe some tests depend on the sleeps done here as a side-effect?
          Hide
          Mark Miller added a comment -

          I think maybe some tests depend on the sleeps done here as a side-effect?

          ugg...that would be a bummer, but it's certainly possible.

          Show
          Mark Miller added a comment - I think maybe some tests depend on the sleeps done here as a side-effect? ugg...that would be a bummer, but it's certainly possible.
          Hide
          Dawid Weiss added a comment -

          It all depends where things went wrong. If it's a test that failed and you're at the after-test or after-suite level then I think it'd be possible to tell that a previous failure has occurred. In fact, it's currently being done in the code – TestRuleMarkFailure.java contains code to intercept the "status" of the wrapped code. You need to be careful about a few things (assumptions are propagated as exceptions, for example) but otherwise it's pretty straightforward. One problem with the current use of this rule is that it is always called last to detect failures from other rules. It's a chicken-and-egg problem that doesn't have clean solutions... or so I tend to think – I'd need to take a look at the runner again and thing if maybe this flag could be part of the randomized context.

          The problem in general is that you could create a test rule that intercepts exceptions (or failures) from tests and cancels them out (simply by eating up the thrown exception). So it's not just the fact of throwing an exception from a test that matters – it's crossing the boundary from the suite class back to the runner. If this sounds complex and twisted to you it's because it kind of is...

          You almost always have to dig for the true failure because so many failures can keep cores from being closed

          One thing that should help here is that the exceptions are always reported in the order in which they were captured and chained. So the first exception in the failure report is the one that happened first... I realize it's still a lot of digging, sorry.

          Show
          Dawid Weiss added a comment - It all depends where things went wrong. If it's a test that failed and you're at the after-test or after-suite level then I think it'd be possible to tell that a previous failure has occurred. In fact, it's currently being done in the code – TestRuleMarkFailure.java contains code to intercept the "status" of the wrapped code. You need to be careful about a few things (assumptions are propagated as exceptions, for example) but otherwise it's pretty straightforward. One problem with the current use of this rule is that it is always called last to detect failures from other rules. It's a chicken-and-egg problem that doesn't have clean solutions... or so I tend to think – I'd need to take a look at the runner again and thing if maybe this flag could be part of the randomized context. The problem in general is that you could create a test rule that intercepts exceptions (or failures) from tests and cancels them out (simply by eating up the thrown exception). So it's not just the fact of throwing an exception from a test that matters – it's crossing the boundary from the suite class back to the runner. If this sounds complex and twisted to you it's because it kind of is... You almost always have to dig for the true failure because so many failures can keep cores from being closed One thing that should help here is that the exceptions are always reported in the order in which they were captured and chained. So the first exception in the failure report is the one that happened first... I realize it's still a lot of digging, sorry.
          Hide
          Mark Miller added a comment -

          Looking at the logs of the latest failure I found a bit (this is one of only 2 or 3 types of fails I've seen though)

          It appears that it's trying to create the index for a new collection under the existing collection1's data dir - I don't know if this is due to a race or what, but it's obviously not supposed to be happening...

          oasc.CoreContainer.recordAndThrow SEVERE Unable to create core: collectionLazy5 org.apache.solr.common.SolrException: Cannot create

          Caused by: java.io.IOException: Cannot create directory: C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\solr\build\solr-core\test-files\solr\collection1\data\index
          [junit4:junit4] 2> at org.apache.lucene.store.NativeFSLock.obtain(NativeFSLockFactory.java:171)

          Show
          Mark Miller added a comment - Looking at the logs of the latest failure I found a bit (this is one of only 2 or 3 types of fails I've seen though) It appears that it's trying to create the index for a new collection under the existing collection1's data dir - I don't know if this is due to a race or what, but it's obviously not supposed to be happening... oasc.CoreContainer.recordAndThrow SEVERE Unable to create core: collectionLazy5 org.apache.solr.common.SolrException: Cannot create Caused by: java.io.IOException: Cannot create directory: C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\solr\build\solr-core\test-files\solr\collection1\data\index [junit4:junit4] 2> at org.apache.lucene.store.NativeFSLock.obtain(NativeFSLockFactory.java:171)
          Hide
          Mark Miller added a comment -

          FYI, I've seen this on 3 machines now - my fullmetal jenkins (linux), policeman jenkins (windows), and just recently my 11 inch low powered mac book pro.

          Show
          Mark Miller added a comment - FYI, I've seen this on 3 machines now - my fullmetal jenkins (linux), policeman jenkins (windows), and just recently my 11 inch low powered mac book pro.
          Hide
          Mark Miller added a comment -
          [junit4:junit4]   2> 2265 T632 oasc.SolrCore.initIndex WARNING [collection1] Solr index directory 'C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\solr\build\solr-core\test-files\solr\collection1\data\index' doesn't exist. Creating new index...
          [junit4:junit4]   2> 2265 T633 oasc.SolrCore.initIndex WARNING [collectionLazy5] Solr index directory 'C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\solr\build\solr-core\test-files\solr\collection1\data\index' doesn't exist. Creating new index...
          

          So this may be the race - if collection1 creates it's index dir fast enough, collectionLazy5 prob does not blow up - but it's probably still borked and pointing to collection1's instance dir - I'm guessing the test is just not catching this problem.

          Show
          Mark Miller added a comment - [junit4:junit4] 2> 2265 T632 oasc.SolrCore.initIndex WARNING [collection1] Solr index directory 'C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\solr\build\solr-core\test-files\solr\collection1\data\index' doesn't exist. Creating new index... [junit4:junit4] 2> 2265 T633 oasc.SolrCore.initIndex WARNING [collectionLazy5] Solr index directory 'C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\solr\build\solr-core\test-files\solr\collection1\data\index' doesn't exist. Creating new index... So this may be the race - if collection1 creates it's index dir fast enough, collectionLazy5 prob does not blow up - but it's probably still borked and pointing to collection1's instance dir - I'm guessing the test is just not catching this problem.
          Hide
          Erick Erickson added a comment -

          May fix the test. Separate issue to tack merging into 4x

          Show
          Erick Erickson added a comment - May fix the test. Separate issue to tack merging into 4x
          Hide
          Erick Erickson added a comment -

          Gaaaah! Thanks a million for this detail. I think I know what's going on. SOLR-4149 is an attempt to fix this. Can someone apply 4149 and give it a try or should I just commit 4149?

          Show
          Erick Erickson added a comment - Gaaaah! Thanks a million for this detail. I think I know what's going on. SOLR-4149 is an attempt to fix this. Can someone apply 4149 and give it a try or should I just commit 4149?
          Hide
          Mark Miller added a comment -

          It doesn't happen enough on my mac to easily reproduce - and my mac takes a long time for each test run. I'd just commit and see how it works out.

          Show
          Mark Miller added a comment - It doesn't happen enough on my mac to easily reproduce - and my mac takes a long time for each test run. I'd just commit and see how it works out.
          Hide
          Dawid Weiss added a comment -

          Every time I hear of fullmetal jenkins I have this in front of my eyes.

          Show
          Dawid Weiss added a comment - Every time I hear of fullmetal jenkins I have this in front of my eyes.
          Hide
          Erick Erickson added a comment -

          When merging into 4.x, also include r: 1417907 for SOLR-4149

          Dawid: What did I see somewhere? Something about "that which has been seen cannot be unseen" Iron man meets the butler....

          Show
          Erick Erickson added a comment - When merging into 4.x, also include r: 1417907 for SOLR-4149 Dawid: What did I see somewhere? Something about "that which has been seen cannot be unseen" Iron man meets the butler....
          Hide
          Mark Miller added a comment -

          Ha - I've got to hack that image into my jenkins instance.

          Show
          Mark Miller added a comment - Ha - I've got to hack that image into my jenkins instance.
          Hide
          Dawid Weiss added a comment -

          I think jenkins itself should detect whether it's running under a hypervisor or on bare metal and adjust accordingly... I'd file a JIRA feature request with them, LOL

          Show
          Dawid Weiss added a comment - I think jenkins itself should detect whether it's running under a hypervisor or on bare metal and adjust accordingly... I'd file a JIRA feature request with them, LOL
          Hide
          Erick Erickson added a comment -

          Has anyone with a machine that testLazyCores used to fail on seen failures since I committed my attempt at a fix? I committed SOLR-4149 last week, don't quite know if it's had enough time to really say it's fixed, unless we're seeing more failures....

          FWIW, if it's what I think it was, it was a test artifact rather than the underlying code. I can hope anyway.

          Show
          Erick Erickson added a comment - Has anyone with a machine that testLazyCores used to fail on seen failures since I committed my attempt at a fix? I committed SOLR-4149 last week, don't quite know if it's had enough time to really say it's fixed, unless we're seeing more failures.... FWIW, if it's what I think it was, it was a test artifact rather than the underlying code. I can hope anyway.
          Hide
          Commit Tag Bot added a comment -
          Show
          Commit Tag Bot added a comment - [branch_4x commit] Yonik Seeley http://svn.apache.org/viewvc?view=revision&revision=1420992 SOLR-2592 : SOLR-1028 : SOLR-3922 : SOLR-3911 : sync trunk with 4x
          Hide
          Shawn Heisey added a comment -

          The new swappable default of false needs a prominent mention in the .txt files in sections about upgrading from previous versions where it didn't exist. While working on some other issues on branch_4x, I was nearly bitten by this. Just before I was about to start a full rebuild, I happened to glance at my solr.xml file and noticed the new default had been added. If I hadn't done that, I would have gone through hours of rebuild only to have the core swap at the end fail.

          If swappable is false and you do a SolrJ CoreAdmin#swap, will it throw an exception? I believe it should, for error handling in SolrJ code. I will eventually test this myself, but I won't be able to get to it right now.

          Show
          Shawn Heisey added a comment - The new swappable default of false needs a prominent mention in the .txt files in sections about upgrading from previous versions where it didn't exist. While working on some other issues on branch_4x, I was nearly bitten by this. Just before I was about to start a full rebuild, I happened to glance at my solr.xml file and noticed the new default had been added. If I hadn't done that, I would have gone through hours of rebuild only to have the core swap at the end fail. If swappable is false and you do a SolrJ CoreAdmin#swap, will it throw an exception? I believe it should, for error handling in SolrJ code. I will eventually test this myself, but I won't be able to get to it right now.
          Hide
          Shawn Heisey added a comment -

          Further thought – I believe that the solr.xml file was rewritten when I did a swap between seven pairs of cores sequentially on my last rebuild yesterday. If the first swap rewrote the solr.xml with swappable=false, then I am wondering how the other six swaps were able to complete. Perhaps it's not really being enforced?

          Show
          Shawn Heisey added a comment - Further thought – I believe that the solr.xml file was rewritten when I did a swap between seven pairs of cores sequentially on my last rebuild yesterday. If the first swap rewrote the solr.xml with swappable=false, then I am wondering how the other six swaps were able to complete. Perhaps it's not really being enforced?
          Hide
          Erick Erickson added a comment -

          I picked a really bad name there, I'll change it. Swappable has nothing to do with swapping cores via core admin and everything to do with whether there's a limited number of cores that are loaded at once. "swappable" here means that the system can automatically load/unload cores as needed to handle the case where there are lots of cores and the installation can tolerate the pain of having cores load on demand and be slow for the first few queries.

          There should be no change in behavior as far as the core admin handler "swap" commands no matter what the value of "swappable" in solr.xml is, it's actually an unrelated concept despite the name, but I can sure see why it would be confused....

          Maybe "cacheable"?

          Sorry for the confusion.

          Show
          Erick Erickson added a comment - I picked a really bad name there, I'll change it. Swappable has nothing to do with swapping cores via core admin and everything to do with whether there's a limited number of cores that are loaded at once. "swappable" here means that the system can automatically load/unload cores as needed to handle the case where there are lots of cores and the installation can tolerate the pain of having cores load on demand and be slow for the first few queries. There should be no change in behavior as far as the core admin handler "swap" commands no matter what the value of "swappable" in solr.xml is, it's actually an unrelated concept despite the name, but I can sure see why it would be confused.... Maybe "cacheable"? Sorry for the confusion.
          Hide
          Shawn Heisey added a comment - - edited

          Erick, thank you for the clarification. Mild panic unjustified.

          The mild panic however did uncover a possible problem for you to investigate. If loadOnStartup=true and swappable=true, none of the cores get loaded on startup. Once one is accessed, it does get loaded. This is on branch_4x revision 1421496.

          The wiki currently has the following about this combination:
          loadOnStartup=true swappable=true: There are some cores you want loaded when the server first starts up, but that you'll allow to be swapped out. It's wasteful to specify more cores like this than your swappableCacheSize value.

          Show
          Shawn Heisey added a comment - - edited Erick, thank you for the clarification. Mild panic unjustified. The mild panic however did uncover a possible problem for you to investigate. If loadOnStartup=true and swappable=true, none of the cores get loaded on startup. Once one is accessed, it does get loaded. This is on branch_4x revision 1421496. The wiki currently has the following about this combination: loadOnStartup=true swappable=true: There are some cores you want loaded when the server first starts up, but that you'll allow to be swapped out. It's wasteful to specify more cores like this than your swappableCacheSize value.
          Hide
          Erick Erickson added a comment -

          Gaaaah. See SOLR-4201. Really stupid coding error. If you're feeling really brave, change the code as per the second comment on that JIRA and give it a whirl.....

          I should have something up tomorrow.

          Show
          Erick Erickson added a comment - Gaaaah. See SOLR-4201 . Really stupid coding error. If you're feeling really brave, change the code as per the second comment on that JIRA and give it a whirl..... I should have something up tomorrow.
          Hide
          Steve Rowe added a comment -

          Erick, can this issue be resolved?

          Show
          Steve Rowe added a comment - Erick, can this issue be resolved?
          Hide
          Erick Erickson added a comment -

          Should have resolved this quite a while ago, thanks for prompting me Steve.

          Show
          Erick Erickson added a comment - Should have resolved this quite a while ago, thanks for prompting me Steve.
          Hide
          Yonik Seeley added a comment -

          I started a really quick review of the code, but the first method I visited (CoreContainer.getCore()) seems to have a race condition?
          Trying to open the same core more than once concurrently seems like it could get pretty messy, and it doesn't seem like it's protected against. All one would have to do is send in more than one request targeted toward the same core around the same time to trigger it.

          Show
          Yonik Seeley added a comment - I started a really quick review of the code, but the first method I visited (CoreContainer.getCore()) seems to have a race condition? Trying to open the same core more than once concurrently seems like it could get pretty messy, and it doesn't seem like it's protected against. All one would have to do is send in more than one request targeted toward the same core around the same time to trigger it.
          Hide
          Erick Erickson added a comment - - edited

          You're right, thanks! I'm a bit reluctant to extend the synchronized block around the entire following try block just because loading the core can be quite a lengthy process.

          I have to go away until early evening, but I think I can do something about this tonight. Do you think the risk/reward ratio is OK for 4.1? This will only manifest itself with transient cores, which is really fully supported in 4.2 not 4.1 (yes, that is a cop-out). But deadlocks are so easy to introduce when in a rush....

          I've raised another JIRA, SOLR-4300 and called it a "blocker" for now, feel free to change that depending...

          This seems really trappy, I'm wondering if for 4.2 I should encapsulate this whole "lazy" and "transient" and "permanent" distinction into a class that handles all the messy details, it's too easy to miss one of these places in the current code. There are several other places in CoreContainer that synchronize on "cores" and it's not obvious whether all of them should worry about transient cores or not. E.g. renaming cores.

          But what I'm thinking of is a "pending" set of cores, and the code would look something like this: (just above the "try"). Note this is just a sketch, I haven't checked it out much at all just wondering if it makes sense.

          See the patch at SOLR-4300 for the actual code.

          Show
          Erick Erickson added a comment - - edited You're right, thanks! I'm a bit reluctant to extend the synchronized block around the entire following try block just because loading the core can be quite a lengthy process. I have to go away until early evening, but I think I can do something about this tonight. Do you think the risk/reward ratio is OK for 4.1? This will only manifest itself with transient cores, which is really fully supported in 4.2 not 4.1 (yes, that is a cop-out). But deadlocks are so easy to introduce when in a rush.... I've raised another JIRA, SOLR-4300 and called it a "blocker" for now, feel free to change that depending... This seems really trappy, I'm wondering if for 4.2 I should encapsulate this whole "lazy" and "transient" and "permanent" distinction into a class that handles all the messy details, it's too easy to miss one of these places in the current code. There are several other places in CoreContainer that synchronize on "cores" and it's not obvious whether all of them should worry about transient cores or not. E.g. renaming cores. But what I'm thinking of is a "pending" set of cores, and the code would look something like this: (just above the "try"). Note this is just a sketch, I haven't checked it out much at all just wondering if it makes sense. See the patch at SOLR-4300 for the actual code.

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Noble Paul
            • Votes:
              8 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development