[SOLR-4038] On OOM, SolrCloud indexing blocks if shard is marked as DOWN - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 4.0
Fix Version/s: 4.2, 6.0
Component/s: SolrCloud
Labels:
None

Description

See: http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html

While indexing (without CloudSolrServer at that time) one node dies with an OOME perhaps because of the linked issue ~~SOLR-4032~~. The OOME stack traces are varied but here are some ZK-related logs between the OOME stack traces:

2012-11-02 14:14:37,126 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
2012-11-02 14:14:37,127 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (2) core=shard_e
2012-11-02 14:14:37,127 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 8.0 seconds before trying to recover again (3)
2012-11-02 14:14:45,328 INFO [solr.cloud.ZkController] - [RecoveryThread] - : numShards not found on descriptor - reading it from system property
2012-11-02 14:14:45,363 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Starting Replication Recovery. core=shard_e
2012-11-02 14:14:45,363 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-02 14:14:45,775 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10)
2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Begin buffering updates. core=shard_e
2012-11-02 14:14:50,987 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Attempting to replicate from http://rot05.solrserver:8080/solr/shard_e/. core=shard_e
2012-11-02 14:14:50,987 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2012-11-02 14:15:03,303 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index
2012-11-02 14:15:03,303 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_f/data/index.20121102141424591 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1520a48c; maxCacheMB=48.0 maxMergeSizeMB=4.0)
2012-11-02 14:15:09,421 INFO [apache.zookeeper.ClientCnxn] - [main-SendThread(rot1.zkserver:2181)] - : Client session timed out, have not heard from server in 11873ms for sessionid 0x13abc504486000f, closing socket connection and attempting reconnect
2012-11-02 14:15:09,422 ERROR [solr.core.SolrCore] - [http-8080-exec-1] - : org.apache.solr.common.SolrException: Ping query caused exception: Java heap space
.....
.....
2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - [main-EventThread] - : Watcher org.apache.solr.common.cloud.ConnectionManager@305e7020 name:ZooKeeperConnection Watcher:rot1.zkserver:2181,rot2.zkserver:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None
2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - [main-EventThread] - : zkClient has disconnected
2012-11-02 14:15:09,869 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:java.lang.OutOfMemoryError: Java heap space
.....
.....
2012-11-02 14:15:10,159 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
2012-11-02 14:15:10,159 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (3) core=shard_e
2012-11-02 14:15:10,159 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 16.0 seconds before trying to recover again (4)
2012-11-02 14:15:09,878 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index.20121102141424591
2012-11-02 14:15:10,192 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f_f/data/index
2012-11-02 14:15:10,192 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _773.tvf completely. Downloaded 246
415360!=562327645
.....
.....

At this point indexing has already been blocked. Some nodes do not write anything to the logs and the two surrounding nodes are still busy doing some replication. Most nodes show a increased number of state changes:

2012-11-02 14:16:47,768 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10)

It's about 10 minutes of doing little by trying replication by the surrounding nodes and mostly state changes for the other nodes before are documents indexed again. At that point i stopped the node.

Attachments

Issue Links

relates to

SOLR-4032 Files larger than an internal buffer size fail to replicate

Closed

On OOM, SolrCloud indexing blocks if shard is marked as DOWN

Details

Description

Attachments

Issue Links

Activity

People

Dates