Solr
  1. Solr
  2. SOLR-4144

SolrCloud replication high heap consumption

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 5.0
    • Fix Version/s: 4.1, 5.0
    • Labels:
      None
    • Environment:

      5.0-SNAPSHOT 1366361:1416494M - markus - 2012-12-03 14:09:13

      Description

      Recent versions of SolrCloud require a very high heap size vs. older versions. Another cluster of 5.0.0.2012.10.09.19.29.59 (~4GB per core) can restore an empty node without taking a lot of heap (xmx=256m). Recent versions and current trunk fail miserably even with a higher heap (750m). Both clusters have 10 nodes, 10 shards and 2 cores per node. One note to add is that the cluster on which this fails has only about 1.5GB per core due to changing in the Lucene codec such as compression.

      After start up everything goes fine...

      2012-12-04 15:05:35,013 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Begin buffering updates. core=shard_c
      2012-12-04 15:05:35,013 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Begin buffering updates. core=shard_b
      2012-12-04 15:05:35,013 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
      2012-12-04 15:05:35,013 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null}
      2012-12-04 15:05:35,013 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Attempting to replicate from http://178.21.118.190:8080/solr/shard_b/. core=shard_b
      2012-12-04 15:05:35,013 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Attempting to replicate from http://178.21.118.192:8080/solr/shard_c/. core=shard_c
      2012-12-04 15:05:35,014 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
      2012-12-04 15:05:35,014 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
      2012-12-04 15:05:35,052 INFO [solr.handler.ReplicationHandler] - [RecoveryThread] - : Commits will be reserved for  10000
      2012-12-04 15:05:35,052 INFO [solr.handler.ReplicationHandler] - [RecoveryThread] - : Commits will be reserved for  10000
      2012-12-04 15:05:35,053 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnectionsPerHost=10000
      2012-12-04 15:05:35,060 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :  No value set for 'pollInterval'. Timer Task not started.
      2012-12-04 15:05:35,060 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :  No value set for 'pollInterval'. Timer Task not started.
      2012-12-04 15:05:35,069 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Master's generation: 48
      2012-12-04 15:05:35,069 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Slave's generation: 1
      2012-12-04 15:05:35,069 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting replication process
      2012-12-04 15:05:35,069 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Master's generation: 47
      2012-12-04 15:05:35,069 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Slave's generation: 1
      2012-12-04 15:05:35,070 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting replication process
      2012-12-04 15:05:35,078 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Number of files in latest index in master: 235
      2012-12-04 15:05:35,079 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Number of files in latest index in master: 287
      2012-12-04 15:05:35,084 WARN [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : No lockType configured for NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_c/data/index.20121204150535080 lockFactory=org.apache.lucene.store.NativeFSLockFactory@57530551; maxCacheMB=48.0 maxMergeSizeMB=4.0) assuming 'simple'
      2012-12-04 15:05:35,085 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : return new directory for /opt/solr/cores/shard_c/data/index.20121204150535080 forceNew:false
      2012-12-04 15:05:35,085 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_c/data
      2012-12-04 15:05:35,085 WARN [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : No lockType configured for NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_b/data/index.20121204150535079 lockFactory=org.apache.lucene.store.NativeFSLockFactory@512fb063; maxCacheMB=48.0 maxMergeSizeMB=4.0) assuming 'simple'
      2012-12-04 15:05:35,085 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : return new directory for /opt/solr/cores/shard_b/data/index.20121204150535079 forceNew:false
      2012-12-04 15:05:35,085 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_b/data
      2012-12-04 15:05:35,088 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting download to NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_c/data/index.20121204150535080 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@3bd48043; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
      2012-12-04 15:05:35,089 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting download to NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_b/data/index.20121204150535079 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@67fc9fee; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
      

      until suddenly

      2012-12-03 16:14:58,862 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_b/data/index
      2012-12-03 16:15:06,357 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_b/data/index.20121203161342097 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@424c2849; maxCacheMB=48.0 maxMergeSizeMB=4.0)
      2012-12-03 16:14:58,610 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_c/data/index
      2012-12-03 16:15:06,128 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_c] webapp=/solr path=/admin/system params={wt=json} status=0 QTime=11498 
      2012-12-03 16:15:07,644 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-5] - : null:java.lang.OutOfMemoryError: Java heap space
      
      2012-12-03 16:15:07,644 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_c/data/index.20121203161342096 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@7a67f797; maxCacheMB=48.0 maxMergeSizeMB=4.0)
      2012-12-03 16:15:39,655 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-4] - : null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
      

      Just now it succeeded with Xmx=850m and NewRatio=1. Another test failed with Xmx=750m and NewRatio=1. We can reproduce this behaviour rather easy by purging the data directories and simply starting the node with less heap than it today requires for replication.

      Please also see:
      http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%3Czarafa.5093d4ee.58d7.528aacd34e1627b6@mail.openindex.io%3E

      1. SOLR-4144.patch
        6 kB
        Yonik Seeley

        Activity

        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Mark Robert Miller
        http://svn.apache.org/viewvc?view=revision&revision=1421334

        SOLR-4144: CHANGES entry

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1421334 SOLR-4144 : CHANGES entry
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Mark Robert Miller
        http://svn.apache.org/viewvc?view=revision&revision=1421331

        SOLR-4144: CHANGES entry

        Show
        Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1421331 SOLR-4144 : CHANGES entry
        Hide
        Mark Miller added a comment -

        It's great to have someone so actively engaging with 5x and reporting issues

        Show
        Mark Miller added a comment - It's great to have someone so actively engaging with 5x and reporting issues
        Hide
        Markus Jelsma added a comment -

        I think this is resolved now, i don't see old index directories and the heap issue is definitely gone! Great work!

        Show
        Markus Jelsma added a comment - I think this is resolved now, i don't see old index directories and the heap issue is definitely gone! Great work!
        Hide
        Mark Miller added a comment -

        I committed a fix as part of SOLR-3911

        Show
        Mark Miller added a comment - I committed a fix as part of SOLR-3911
        Hide
        Mark Miller added a comment -

        Prob due a file name misspelling yonik spotted yesterday...

        Show
        Mark Miller added a comment - Prob due a file name misspelling yonik spotted yesterday...
        Hide
        Markus Jelsma added a comment - - edited

        Ok, the heap thing seems to be resolved now. I could not replicate the error after deploying the patch but i could reproduce it reliably before. However, i'm starting to see stale index directories again, similar to a problem that was fixed earlier. Even on a clean cluster with empty data directories we see stale index.* directories some time after we start indexing data.

        Out of 20 cores on the cluster a few have both an index and index.* directory. The index directory is empty most of the times. We can not reliably reproduce this but at least two times we saw a node replicating the entire index after restart while indexing to the cluster. After replication the old dir is not always cleaned up. At least once we've seen a node cleaning up the mess after we forced it to replicate after restart.

        Show
        Markus Jelsma added a comment - - edited Ok, the heap thing seems to be resolved now. I could not replicate the error after deploying the patch but i could reproduce it reliably before. However, i'm starting to see stale index directories again, similar to a problem that was fixed earlier. Even on a clean cluster with empty data directories we see stale index.* directories some time after we start indexing data. Out of 20 cores on the cluster a few have both an index and index.* directory. The index directory is empty most of the times. We can not reliably reproduce this but at least two times we saw a node replicating the entire index after restart while indexing to the cluster. After replication the old dir is not always cleaned up. At least once we've seen a node cleaning up the mess after we forced it to replicate after restart.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Yonik Seeley
        http://svn.apache.org/viewvc?view=revision&revision=1419980

        SOLR-4144: don't cache replicated index files

        Show
        Commit Tag Bot added a comment - [trunk commit] Yonik Seeley http://svn.apache.org/viewvc?view=revision&revision=1419980 SOLR-4144 : don't cache replicated index files
        Hide
        Mark Miller added a comment -

        +1! Patch looks good.

        Show
        Mark Miller added a comment - +1! Patch looks good.
        Hide
        Markus Jelsma added a comment -

        Yonik, we have a cluster standing by on which i can reproduce the problem. I'll reconfirm this problem and then deploy the wars with your patch tomorrow and report back.

        Show
        Markus Jelsma added a comment - Yonik, we have a cluster standing by on which i can reproduce the problem. I'll reconfirm this problem and then deploy the wars with your patch tomorrow and report back.
        Hide
        Yonik Seeley added a comment -

        Here's a patch that should hopefully avoid caching of replicated index files. Unfortunately I don't really have a good way to unit test that these files aren't being cached though...

        Show
        Yonik Seeley added a comment - Here's a patch that should hopefully avoid caching of replicated index files. Unfortunately I don't really have a good way to unit test that these files aren't being cached though...
        Hide
        Yonik Seeley added a comment -

        I bet this could be due to NRTCachingDirectory? It makes the decision to cache a file or not up-front and can't change when it's part-way through the file.

        If there's no mergeInfo or flushInfo in the context (and the file isn't the segments file) then it will chose to cache the file.
        We need to pass something (like flushInfo) that will convince it not to cache. I'll work up a patch...

        Show
        Yonik Seeley added a comment - I bet this could be due to NRTCachingDirectory? It makes the decision to cache a file or not up-front and can't change when it's part-way through the file. If there's no mergeInfo or flushInfo in the context (and the file isn't the segments file) then it will chose to cache the file. We need to pass something (like flushInfo) that will convince it not to cache. I'll work up a patch...
        Hide
        Mark Miller added a comment -

        Thanks for the report! I will try and investigate this soon - I'm on an 11 inch low powered macbook air for the week, so I may not get to it till next week, but we will see...

        Show
        Mark Miller added a comment - Thanks for the report! I will try and investigate this soon - I'm on an 11 inch low powered macbook air for the week, so I may not get to it till next week, but we will see...

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Markus Jelsma
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development