Solr
  1. Solr
  2. SOLR-1781

Replication index directories not always cleaned up

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 4.0-BETA, 5.0
    • Labels:
      None
    • Environment:

      Windows Server 2003 R2, Java 6b18

      Description

      We had the same problem as someone described in http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201001.mbox/%3c222A518D-DDF5-4FC8-A02A-74D4F232B339@snooth.com%3e. A partial copy of that message:

      We're using the new replication and it's working pretty well. There's
      one detail I'd like to get some more information about.

      As the replication works, it creates versions of the index in the data
      directory. Originally we had index/, but now there are dated versions
      such as index.20100127044500/, which are the replicated versions.

      Each copy is sized in the vicinity of 65G. With our current hard drive
      it's fine to have two around, but 3 gets a little dicey. Sometimes
      we're finding that the replication doesn't always clean up after
      itself. I would like to understand this better, or to not have this
      happen. It could be a configuration issue.

      1. SOLR-1781.patch
        3 kB
        Mark Miller
      2. SOLR-1781.patch
        10 kB
        Mark Miller
      3. 0001-Replication-does-not-always-clean-up-old-directories.patch
        2 kB
        Terje Sten Bjerkseth

        Issue Links

          Activity

          Hide
          Terje Sten Bjerkseth added a comment -

          This patch seems to fix the problem, mostly.

          Show
          Terje Sten Bjerkseth added a comment - This patch seems to fix the problem, mostly.
          Hide
          Noble Paul added a comment -

          The problem is that while you are trying to delete the original index, there may be requests in pipleline which uses the old index. If the index files are deleted those requests may fail.

          Show
          Noble Paul added a comment - The problem is that while you are trying to delete the original index, there may be requests in pipleline which uses the old index. If the index files are deleted those requests may fail.
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Shawn Heisey added a comment -

          I have been having this problem too, but most of the time it's fine. After a discussion today on the mailing list, I believe it is related to a replication initiated by swapping cores on the master.

          Show
          Shawn Heisey added a comment - I have been having this problem too, but most of the time it's fine. After a discussion today on the mailing list, I believe it is related to a replication initiated by swapping cores on the master.
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Markus Jelsma added a comment -

          This happens almost daily on our SolrCloud (trunk) test cluster, we sometimes see four surpluss index directories created in a day.

          Show
          Markus Jelsma added a comment - This happens almost daily on our SolrCloud (trunk) test cluster, we sometimes see four surpluss index directories created in a day.
          Hide
          Mark Miller added a comment -

          Markus: on windows?

          Show
          Mark Miller added a comment - Markus: on windows?
          Hide
          Markus Jelsma added a comment -

          We don't have that, i should have included it in my comment. All servers run Debian GNU/Linux 6.0 and the cloud test cluster always runs with a very recent build from trunk.

          Show
          Markus Jelsma added a comment - We don't have that, i should have included it in my comment. All servers run Debian GNU/Linux 6.0 and the cloud test cluster always runs with a very recent build from trunk.
          Hide
          Mark Miller added a comment -

          I'll look into taking care of this.

          Show
          Mark Miller added a comment - I'll look into taking care of this.
          Hide
          Mark Miller added a comment -

          I'm not sure who or when, but it looks like a form of the above patch has already been committed, except instead of:

          +        else if (successfulInstall) {
          +          delTree(indexDir);
          +        }
          

          it's

          +        else {
          +          delTree(indexDir);
          +        }
          

          I think there is a problem with both. Anyway, I can duplicate this issue easy enough - I'll figure out the right fix.

          Show
          Mark Miller added a comment - I'm not sure who or when, but it looks like a form of the above patch has already been committed, except instead of: + else if (successfulInstall) { + delTree(indexDir); + } it's + else { + delTree(indexDir); + } I think there is a problem with both. Anyway, I can duplicate this issue easy enough - I'll figure out the right fix.
          Hide
          Mark Miller added a comment -

          Hmm...so I think I have a fix...but it involves reloading the core even if you do not replicate conf files.

          The problem is, unless you reload the core, it's very difficult to know when searchers are done with the old index.

          Show
          Mark Miller added a comment - Hmm...so I think I have a fix...but it involves reloading the core even if you do not replicate conf files. The problem is, unless you reload the core, it's very difficult to know when searchers are done with the old index.
          Hide
          Markus Jelsma added a comment -

          Perhaps they could be cleaned up on core start or after some time has passed?

          Show
          Markus Jelsma added a comment - Perhaps they could be cleaned up on core start or after some time has passed?
          Hide
          Mark Miller added a comment -

          after some time has passed?

          This was tempting for a moment - but it feels dirty.

          You could just say, hey, do it on the next core restart - but that may not come for a long time.

          You'd really like to be able to tell when all the indexreaders on the old dir are done, but that is very difficult it turns out. The only easy way to do it is to reload the core.

          If you replicate configuration files, you have to reload the core anyway. I'd say it's probably the way to go. It should not be that much more expensive than reloading a searcher in the larger scheme of replication.

          Show
          Mark Miller added a comment - after some time has passed? This was tempting for a moment - but it feels dirty. You could just say, hey, do it on the next core restart - but that may not come for a long time. You'd really like to be able to tell when all the indexreaders on the old dir are done, but that is very difficult it turns out. The only easy way to do it is to reload the core. If you replicate configuration files, you have to reload the core anyway. I'd say it's probably the way to go. It should not be that much more expensive than reloading a searcher in the larger scheme of replication.
          Hide
          Mark Miller added a comment -

          Patch that starts reloading on replication even when config is not replicated.

          Also, so that existing tests will pass, the dataDir is now consistant across reloads - it won't pick up a new setting from config.
          I conferred about this latter change with hossman in IRC, so for now, that's the way I'm going.

          Show
          Mark Miller added a comment - Patch that starts reloading on replication even when config is not replicated. Also, so that existing tests will pass, the dataDir is now consistant across reloads - it won't pick up a new setting from config. I conferred about this latter change with hossman in IRC, so for now, that's the way I'm going.
          Hide
          Mark Miller added a comment - - edited

          Spoke to yonik about this real quick and he brought up the ref counting we have for Directories as an alternative to using reload. It took a couple new methods to our DirectoryFactory, but I think I have something working using that now.

          Show
          Mark Miller added a comment - - edited Spoke to yonik about this real quick and he brought up the ref counting we have for Directories as an alternative to using reload. It took a couple new methods to our DirectoryFactory, but I think I have something working using that now.
          Hide
          Mark Miller added a comment -

          Latest patch - I'll commit this soon.

          Show
          Mark Miller added a comment - Latest patch - I'll commit this soon.
          Hide
          Mark Miller added a comment -

          Committed.

          Show
          Mark Miller added a comment - Committed.
          Hide
          Markus Jelsma added a comment -

          Hi, is the core reloading still part of this? I get a lot of firstSearcher events on a test node now and it won't get online. Going back to July 18th (before this patch) build works fine. Other nodes won't come online with a build from the 19th (after this patch).

          Show
          Markus Jelsma added a comment - Hi, is the core reloading still part of this? I get a lot of firstSearcher events on a test node now and it won't get online. Going back to July 18th (before this patch) build works fine. Other nodes won't come online with a build from the 19th (after this patch).
          Hide
          Mark Miller added a comment -

          No, no reload. Can you please elaborate on what not going online means. Can you share logs?

          Show
          Mark Miller added a comment - No, no reload. Can you please elaborate on what not going online means. Can you share logs?
          Hide
          Markus Jelsma added a comment -

          The node will never respond to HTTP requests, all ZK connections time out, very high resource consumption. I'll try provide a log snippet soon. I tried running today's build several times but one specific node refuses to `come online`. Another node did well and runs today's build.

          I cannot attach a file to a resolved issue. Send over mail?

          Show
          Markus Jelsma added a comment - The node will never respond to HTTP requests, all ZK connections time out, very high resource consumption. I'll try provide a log snippet soon. I tried running today's build several times but one specific node refuses to `come online`. Another node did well and runs today's build. I cannot attach a file to a resolved issue. Send over mail?
          Hide
          Mark Miller added a comment -

          Ill reopen - email is fine as well.

          Show
          Mark Miller added a comment - Ill reopen - email is fine as well.
          Hide
          Markus Jelsma added a comment -

          Log sent.

          This node has two shards on it and executed 2x 512 warmup queries which adds up. It won't talk to ZK (tail of the log). Restarting the node with an 18th's build works fine. Did it three times today.
          Thanks

          Show
          Markus Jelsma added a comment - Log sent. This node has two shards on it and executed 2x 512 warmup queries which adds up. It won't talk to ZK (tail of the log). Restarting the node with an 18th's build works fine. Did it three times today. Thanks
          Hide
          Mark Miller added a comment -

          Hmm...i can't replicate this issue so far.

          Another change around then was updating to ZooKeeper 3.3.5 (bug fix update).

          I wouldnt expect that to be an issue - but are you just upgrading one node and not all of them?

          Show
          Mark Miller added a comment - Hmm...i can't replicate this issue so far. Another change around then was updating to ZooKeeper 3.3.5 (bug fix update). I wouldnt expect that to be an issue - but are you just upgrading one node and not all of them?
          Hide
          Markus Jelsma added a comment -

          Strange indeed. I can/could replicate it on one machine consistently and not on others. Machines weren't upgraded at the same time to prevent cluster downtime.

          I'll check back monday, there are two other machines left to upgrade plus the bad node.

          Show
          Markus Jelsma added a comment - Strange indeed. I can/could replicate it on one machine consistently and not on others. Machines weren't upgraded at the same time to prevent cluster downtime. I'll check back monday, there are two other machines left to upgrade plus the bad node.
          Hide
          Mark Miller added a comment -

          Machines weren't upgraded at the same time to prevent cluster downtime.

          Yeah, makes sense, just wasn't sure how you went about it. I'd expect a bugfix release of zookeeper to work no problem with the previous nodes, but it's the other variable I think. They recommend upgrading with rolling restarts, so it shouldn't be the problem...

          Show
          Mark Miller added a comment - Machines weren't upgraded at the same time to prevent cluster downtime. Yeah, makes sense, just wasn't sure how you went about it. I'd expect a bugfix release of zookeeper to work no problem with the previous nodes, but it's the other variable I think. They recommend upgrading with rolling restarts, so it shouldn't be the problem...
          Hide
          Markus Jelsma added a comment -

          The problem is resolved. The `bad` node created several new index.[0-9] directories, even with this patch, and caused high I/O. I deleted the complete data directory so also the index.properties file. It loaded its index from the other nodes and no longer created many index dirs.

          Thanks

          Show
          Markus Jelsma added a comment - The problem is resolved. The `bad` node created several new index. [0-9] directories, even with this patch, and caused high I/O. I deleted the complete data directory so also the index.properties file. It loaded its index from the other nodes and no longer created many index dirs. Thanks
          Hide
          Markus Jelsma added a comment -

          One of the nodes ended up with two index directories today. Later some other nodes also didn't clean up after they got restarted.

          Show
          Markus Jelsma added a comment - One of the nodes ended up with two index directories today. Later some other nodes also didn't clean up after they got restarted.
          Hide
          Mark Miller added a comment -

          Odd - in the first case, you are sure the indexes were there over time? For brief periods, more than once can exist...it should just end up being cleaned up when no longer in use.

          I can try and dig in some more, but I'll have to think a little - I don't really know where to start.

          My test for this issue is a test that runs a lot of instances and randomly starts and stop them. I then monitor the index directories for these 6-12 instances - I run these tests for a long time and monitor that each keeps ending up with one index dir. At some points, there are two indexes - but it always drops to one shortly later.

          So there still may be some hole, but I don't know where or how.

          If you look in the logs, perhaps you will see a bunch of "Unable to delete directory : " entries? It might be that it's trying but cannot. It might make sense to start using deleteOnExit as a last resort if delete fails - I just looked and the del call in SnapPuller does not do this.

          Show
          Mark Miller added a comment - Odd - in the first case, you are sure the indexes were there over time? For brief periods, more than once can exist...it should just end up being cleaned up when no longer in use. I can try and dig in some more, but I'll have to think a little - I don't really know where to start. My test for this issue is a test that runs a lot of instances and randomly starts and stop them. I then monitor the index directories for these 6-12 instances - I run these tests for a long time and monitor that each keeps ending up with one index dir. At some points, there are two indexes - but it always drops to one shortly later. So there still may be some hole, but I don't know where or how. If you look in the logs, perhaps you will see a bunch of "Unable to delete directory : " entries? It might be that it's trying but cannot. It might make sense to start using deleteOnExit as a last resort if delete fails - I just looked and the del call in SnapPuller does not do this.
          Hide
          Markus Jelsma added a comment -

          Hi,

          I've deleted both today and yesterday indexes of more than a few hours old.

          Some are being removed indeed and some persist. I just restarted all nodes (introduced new fieldTypes and one field) and at least one node has three index directories. Others had two, some just one. Not a single node has a `unable to delete` string in the logs.

          Thanks

          Show
          Markus Jelsma added a comment - Hi, I've deleted both today and yesterday indexes of more than a few hours old. Some are being removed indeed and some persist. I just restarted all nodes (introduced new fieldTypes and one field) and at least one node has three index directories. Others had two, some just one. Not a single node has a `unable to delete` string in the logs. Thanks
          Hide
          Mark Miller added a comment -

          Could you tell me a little bit about the operations you are performing on your cluster - perhaps I am missing some activity that is needed to trigger the remaining issue(s)?

          Show
          Mark Miller added a comment - Could you tell me a little bit about the operations you are performing on your cluster - perhaps I am missing some activity that is needed to trigger the remaining issue(s)?
          Hide
          Markus Jelsma added a comment -

          Besides search and index only the occasional restart when i change some config or deploy a new build. Sometimes i need to start ZK 3.4 again because it died for some reason. Restarting Tomcat a few times in a row may be a clue here. I'll check again tomorrow if whether it's consistent.

          Show
          Markus Jelsma added a comment - Besides search and index only the occasional restart when i change some config or deploy a new build. Sometimes i need to start ZK 3.4 again because it died for some reason. Restarting Tomcat a few times in a row may be a clue here. I'll check again tomorrow if whether it's consistent.
          Hide
          Mark Miller added a comment -

          occasional restart when i change some config

          You can also do that by reloading all the cores in the cluster - the latest collections api work has a collection reload command.

          Restarting Tomcat a few times in a row may be a clue here.

          Yeah, if any docs where added while a node was down, it will replicate and perhaps get a new index dir. I'm restarting jetty in my tests literally hundreds of times in a row and have not seen this yet. I also just added the search thread, but so far no luck.

          Show
          Mark Miller added a comment - occasional restart when i change some config You can also do that by reloading all the cores in the cluster - the latest collections api work has a collection reload command. Restarting Tomcat a few times in a row may be a clue here. Yeah, if any docs where added while a node was down, it will replicate and perhaps get a new index dir. I'm restarting jetty in my tests literally hundreds of times in a row and have not seen this yet. I also just added the search thread, but so far no luck.
          Hide
          Markus Jelsma added a comment -

          Hi,

          I'll restart one node with two cores.

          #cat cores/openindex_b/data/index.properties 
          #index properties
          #Wed Jul 25 09:58:26 UTC 2012
          index=index.20120725095644707
          
          # du -h cores/
          4.0K    cores/lib
          46M     cores/openindex_b/data/index.20120725095644707
          404K    cores/openindex_b/data/tlog
          46M     cores/openindex_b/data
          46M     cores/openindex_b
          98M     cores/openindex_a/data/index.20120725095843731
          124K    cores/openindex_a/data/tlog
          98M     cores/openindex_a/data
          98M     cores/openindex_a
          144M    cores/
          

          2012-07-25 10:01:09,176 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_b/data/index.20120725095644707
          ...
          2012-07-25 10:01:17,303 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_a/data/index.20120725095843731
          ...
          2012-07-25 10:01:55,016 WARN [solr.core.SolrCore] - [RecoveryThread] - : New index directory detected: old=/opt/solr/cores/openindex_b/data/index.20120725095644707 new=/opt/solr/cores/openindex_b/data/index.20120725100120496
          ...
          2012-07-25 10:03:35,236 WARN [solr.core.SolrCore] - [RecoveryThread] - : New index directory detected: old=/opt/solr/cores/openindex_a/data/index.20120725100220706 new=/opt/solr/cores/openindex_a/data/index.20120725100321897

          # du -h cores/
          4.0K    cores/lib
          46M     cores/openindex_b/data/index.20120725095644707
          404K    cores/openindex_b/data/tlog
          46M     cores/openindex_b/data/index.20120725100120496
          91M     cores/openindex_b/data
          91M     cores/openindex_b
          98M     cores/openindex_a/data/index.20120725100321897
          98M     cores/openindex_a/data/index.20120725100220706
          124K    cores/openindex_a/data/tlog
          196M    cores/openindex_a/data
          196M    cores/openindex_a
          287M    cores/
          

          A few minutes later we still have multiple index directories. No updates have been sent to the cluster during this whole scenario. Each time another directory appears it comes with a lot of I/O, on these RAM limited machines it's almost trashing because of the additional directory. It does not create another directory on each restart but sometimes does, it restarted the same machine again and now i have three dirs for each core.

          I'll turn on INFO logging for the node and restart it again without deleting the surpluss dirs. The master and slave versions are still the same.

          # du -h cores/
          4.0K    cores/lib
          46M     cores/openindex_b/data/index.20120725100813961
          42M     cores/openindex_b/data/index.20120725101349376
          46M     cores/openindex_b/data/index.20120725095644707
          46M     cores/openindex_b/data/index.20120725101231289
          404K    cores/openindex_b/data/tlog
          46M     cores/openindex_b/data/index.20120725100120496
          223M    cores/openindex_b/data
          223M    cores/openindex_b
          98M     cores/openindex_a/data/index.20120725101252920
          98M     cores/openindex_a/data/index.20120725100220706
          124K    cores/openindex_a/data/tlog
          196M    cores/openindex_a/data
          196M    cores/openindex_a
          418M    cores/
          

          Maybe it cannot find the current index directory on start up (in my case).

          2012-07-25 10:13:36,125 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_b/data/index.20120725101231289
          2012-07-25 10:13:45,840 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_a/data/index.20120725101252920
          2012-07-25 10:15:41,393 WARN [solr.core.SolrCore] - [RecoveryThread] - : New index directory detected: old=/opt/solr/cores/openindex_b/data/index.20120725101231289 new=/opt/solr/cores/openindex_b/data/index.20120725101349376
          2012-07-25 10:15:46,895 WARN [solr.cloud.RecoveryStrategy] - [main-EventThread] - : Stopping recovery for core openindex_b zkNodeName=nl2.index.openindex.io:8080_solr_openindex_b
          2012-07-25 10:15:46,952 WARN [solr.core.SolrCore] - [RecoveryThread] - : [openindex_a] Error opening new searcher. exceeded limit of maxWarmingSearchers=1, try again later.
          2012-07-25 10:15:47,298 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover.
          org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=1, try again later.
          at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1365)
          at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1157)
          at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:560)
          at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:316)
          at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:210)
          2012-07-25 10:15:47,299 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again...

          This is crazy

          btw: this is today's build.

          Show
          Markus Jelsma added a comment - Hi, I'll restart one node with two cores. #cat cores/openindex_b/data/index.properties #index properties #Wed Jul 25 09:58:26 UTC 2012 index=index.20120725095644707 # du -h cores/ 4.0K cores/lib 46M cores/openindex_b/data/index.20120725095644707 404K cores/openindex_b/data/tlog 46M cores/openindex_b/data 46M cores/openindex_b 98M cores/openindex_a/data/index.20120725095843731 124K cores/openindex_a/data/tlog 98M cores/openindex_a/data 98M cores/openindex_a 144M cores/ 2012-07-25 10:01:09,176 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_b/data/index.20120725095644707 ... 2012-07-25 10:01:17,303 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_a/data/index.20120725095843731 ... 2012-07-25 10:01:55,016 WARN [solr.core.SolrCore] - [RecoveryThread] - : New index directory detected: old=/opt/solr/cores/openindex_b/data/index.20120725095644707 new=/opt/solr/cores/openindex_b/data/index.20120725100120496 ... 2012-07-25 10:03:35,236 WARN [solr.core.SolrCore] - [RecoveryThread] - : New index directory detected: old=/opt/solr/cores/openindex_a/data/index.20120725100220706 new=/opt/solr/cores/openindex_a/data/index.20120725100321897 # du -h cores/ 4.0K cores/lib 46M cores/openindex_b/data/index.20120725095644707 404K cores/openindex_b/data/tlog 46M cores/openindex_b/data/index.20120725100120496 91M cores/openindex_b/data 91M cores/openindex_b 98M cores/openindex_a/data/index.20120725100321897 98M cores/openindex_a/data/index.20120725100220706 124K cores/openindex_a/data/tlog 196M cores/openindex_a/data 196M cores/openindex_a 287M cores/ A few minutes later we still have multiple index directories. No updates have been sent to the cluster during this whole scenario. Each time another directory appears it comes with a lot of I/O, on these RAM limited machines it's almost trashing because of the additional directory. It does not create another directory on each restart but sometimes does, it restarted the same machine again and now i have three dirs for each core. I'll turn on INFO logging for the node and restart it again without deleting the surpluss dirs. The master and slave versions are still the same. # du -h cores/ 4.0K cores/lib 46M cores/openindex_b/data/index.20120725100813961 42M cores/openindex_b/data/index.20120725101349376 46M cores/openindex_b/data/index.20120725095644707 46M cores/openindex_b/data/index.20120725101231289 404K cores/openindex_b/data/tlog 46M cores/openindex_b/data/index.20120725100120496 223M cores/openindex_b/data 223M cores/openindex_b 98M cores/openindex_a/data/index.20120725101252920 98M cores/openindex_a/data/index.20120725100220706 124K cores/openindex_a/data/tlog 196M cores/openindex_a/data 196M cores/openindex_a 418M cores/ Maybe it cannot find the current index directory on start up (in my case). 2012-07-25 10:13:36,125 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_b/data/index.20120725101231289 2012-07-25 10:13:45,840 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_a/data/index.20120725101252920 2012-07-25 10:15:41,393 WARN [solr.core.SolrCore] - [RecoveryThread] - : New index directory detected: old=/opt/solr/cores/openindex_b/data/index.20120725101231289 new=/opt/solr/cores/openindex_b/data/index.20120725101349376 2012-07-25 10:15:46,895 WARN [solr.cloud.RecoveryStrategy] - [main-EventThread] - : Stopping recovery for core openindex_b zkNodeName=nl2.index.openindex.io:8080_solr_openindex_b 2012-07-25 10:15:46,952 WARN [solr.core.SolrCore] - [RecoveryThread] - : [openindex_a] Error opening new searcher. exceeded limit of maxWarmingSearchers=1, try again later. 2012-07-25 10:15:47,298 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover. org.apache.solr.common.SolrException: Error opening new searcher. exceeded limit of maxWarmingSearchers=1, try again later. at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1365) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1157) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:560) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:316) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:210) 2012-07-25 10:15:47,299 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... This is crazy btw: this is today's build.
          Hide
          Mark Miller added a comment -

          Thanks for all the info - I think perhaps this is a different issue - I'll try and look into.

          Show
          Mark Miller added a comment - Thanks for all the info - I think perhaps this is a different issue - I'll try and look into.
          Hide
          Mark Miller added a comment -

          Can I get more of that log? If you are not doing updates, I'm surprised it's going into replication at all on restart...

          Show
          Mark Miller added a comment - Can I get more of that log? If you are not doing updates, I'm surprised it's going into replication at all on restart...
          Hide
          Markus Jelsma added a comment -

          Ok, I purged the logs, enabled info and started a tomcat. New indexes are created shortly after :
          2012-07-25 10:13:36,125 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_b/data/index.20120725101231289

          I'll send it right now.

          Show
          Markus Jelsma added a comment - Ok, I purged the logs, enabled info and started a tomcat. New indexes are created shortly after : 2012-07-25 10:13:36,125 WARN [solr.core.SolrCore] - [main] - : New index directory detected: old=null new=/opt/solr/cores/openindex_b/data/index.20120725101231289 I'll send it right now.
          Hide
          Mark Miller added a comment -

          Thanks - there is some sort of race here around opening new searches and what index dir appears. I'm not sure why you see it and I don't in my tests, but it's fairly clear in your logs.

          I think I can address it with a simple change that ensures we are picking up the right directory to remove. I'll commit that change in a moment.

          Show
          Mark Miller added a comment - Thanks - there is some sort of race here around opening new searches and what index dir appears. I'm not sure why you see it and I don't in my tests, but it's fairly clear in your logs. I think I can address it with a simple change that ensures we are picking up the right directory to remove. I'll commit that change in a moment.
          Hide
          Markus Jelsma added a comment -

          It seems this fixes the issue. I'll double check tomorrow!

          Show
          Markus Jelsma added a comment - It seems this fixes the issue. I'll double check tomorrow!
          Hide
          Markus Jelsma added a comment -

          Yes, the problem no longer occurs!
          Great work! Thanks

          Show
          Markus Jelsma added a comment - Yes, the problem no longer occurs! Great work! Thanks
          Hide
          Mark Miller added a comment -

          Thansk Markus!

          Show
          Mark Miller added a comment - Thansk Markus!
          Hide
          Markus Jelsma added a comment -

          There's still a problem with old index directories not being cleaned up and strange replication on start up. I'll write to the ML for this, the problem is likely larger than just cleaning up.

          Show
          Markus Jelsma added a comment - There's still a problem with old index directories not being cleaned up and strange replication on start up. I'll write to the ML for this, the problem is likely larger than just cleaning up.
          Hide
          zhuojunjian added a comment -

          hi
          sure this issue is fixed in solr4.0 ?
          we find the issue in solr4.0 in our enviroment .
          I have created a new JIRA "SOLR-4506".

          Show
          zhuojunjian added a comment - hi sure this issue is fixed in solr4.0 ? we find the issue in solr4.0 in our enviroment . I have created a new JIRA " SOLR-4506 ".

            People

            • Assignee:
              Mark Miller
              Reporter:
              Terje Sten Bjerkseth
            • Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development