Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4038

On OOM, SolrCloud indexing blocks if shard is marked as DOWN

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 4.0
    • 4.2, 6.0
    • SolrCloud
    • None

    Description

      See: http://lucene.472066.n3.nabble.com/SolrCloud-indexing-blocks-if-node-is-recovering-td4017827.html

      While indexing (without CloudSolrServer at that time) one node dies with an OOME perhaps because of the linked issue SOLR-4032. The OOME stack traces are varied but here are some ZK-related logs between the OOME stack traces:

      2012-11-02 14:14:37,126 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
      2012-11-02 14:14:37,127 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (2) core=shard_e
      2012-11-02 14:14:37,127 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 8.0 seconds before trying to recover again (3)
      2012-11-02 14:14:45,328 INFO [solr.cloud.ZkController] - [RecoveryThread] - : numShards not found on descriptor - reading it from system property
      2012-11-02 14:14:45,363 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Starting Replication Recovery. core=shard_e
      2012-11-02 14:14:45,363 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
      2012-11-02 14:14:45,775 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10)
      2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Begin buffering updates. core=shard_e
      2012-11-02 14:14:50,987 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
      2012-11-02 14:14:50,987 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Attempting to replicate from http://rot05.solrserver:8080/solr/shard_e/. core=shard_e
      2012-11-02 14:14:50,987 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
      2012-11-02 14:15:03,303 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index
      2012-11-02 14:15:03,303 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_f/data/index.20121102141424591 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@1520a48c; maxCacheMB=48.0 maxMergeSizeMB=4.0)
      2012-11-02 14:15:09,421 INFO [apache.zookeeper.ClientCnxn] - [main-SendThread(rot1.zkserver:2181)] - : Client session timed out, have not heard from server in 11873ms for sessionid 0x13abc504486000f, closing socket connection and attempting reconnect
      2012-11-02 14:15:09,422 ERROR [solr.core.SolrCore] - [http-8080-exec-1] - : org.apache.solr.common.SolrException: Ping query caused exception: Java heap space
      .....
      .....
      2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - [main-EventThread] - : Watcher org.apache.solr.common.cloud.ConnectionManager@305e7020 name:ZooKeeperConnection Watcher:rot1.zkserver:2181,rot2.zkserver:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None
      2012-11-02 14:15:09,867 INFO [common.cloud.ConnectionManager] - [main-EventThread] - : zkClient has disconnected
      2012-11-02 14:15:09,869 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:java.lang.OutOfMemoryError: Java heap space
      .....
      .....
      2012-11-02 14:15:10,159 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=null}
      2012-11-02 14:15:10,159 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (3) core=shard_e
      2012-11-02 14:15:10,159 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 16.0 seconds before trying to recover again (4)
      2012-11-02 14:15:09,878 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f/data/index.20121102141424591
      2012-11-02 14:15:10,192 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_f_f/data/index
      2012-11-02 14:15:10,192 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _773.tvf completely. Downloaded 246
      415360!=562327645
      .....
      .....
      

      At this point indexing has already been blocked. Some nodes do not write anything to the logs and the two surrounding nodes are still busy doing some replication. Most nodes show a increased number of state changes:

      2012-11-02 14:16:47,768 INFO [common.cloud.ZkStateReader] - [main-EventThread] - : A cluster state change has occurred - updating... (10)
      

      It's about 10 minutes of doing little by trying replication by the surrounding nodes and mostly state changes for the other nodes before are documents indexed again. At that point i stopped the node.

      Attachments

        Issue Links

          Activity

            People

              markrmiller@gmail.com Mark Miller
              markus17 Markus Jelsma
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: