Solr
  1. Solr
  2. SOLR-4032

Files larger than an internal buffer size fail to replicate

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, Trunk
    • Component/s: SolrCloud
    • Labels:
      None
    • Environment:

      5.0-SNAPSHOT 1366361:1404534M - markus - 2012-11-01 12:37:38
      Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.

      Issue Links

        Activity

        Hide
        Mark Miller added a comment -

        Markus - any chance you can give a slightly longer stack trace on the sizeOf issue that is using the wrong directory?

        The provided trace is teasingly short:

        2012-10-30 16:32:26,578 ERROR [handler.admin.CoreAdminHandler] - [http-8080-exec-2] - : IO error while trying to get the size of the Directory:org.apache.lucene.store.NoSuchDirectoryException: directory '/opt/solr/cores/shard_a/data/index' does not exist
        at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:220)
        at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243)
        at org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
        at org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)

        Show
        Mark Miller added a comment - Markus - any chance you can give a slightly longer stack trace on the sizeOf issue that is using the wrong directory? The provided trace is teasingly short: 2012-10-30 16:32:26,578 ERROR [handler.admin.CoreAdminHandler] - [http-8080-exec-2] - : IO error while trying to get the size of the Directory:org.apache.lucene.store.NoSuchDirectoryException: directory '/opt/solr/cores/shard_a/data/index' does not exist at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:220) at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243) at org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132) at org.apache.solr.core.DirectoryFactory.sizeOfDirectory(DirectoryFactory.java:146)
        Hide
        Mark Miller added a comment -

        Also, how easy is this to reproduce? I added some more robust tests to the replication test at one point to try and tickle this, but had no initial luck.

        Show
        Mark Miller added a comment - Also, how easy is this to reproduce? I added some more robust tests to the replication test at one point to try and tickle this, but had no initial luck.
        Hide
        Mark Miller added a comment - - edited

        Finally, you marked this as affects 4 and fix 4.1 - but we don't know this is more than a trunk issue at the moment do we?

        Show
        Mark Miller added a comment - - edited Finally, you marked this as affects 4 and fix 4.1 - but we don't know this is more than a trunk issue at the moment do we?
        Hide
        Markus Jelsma added a comment -

        I'm not exactly sure on how to precisely reproduce this condition but i've been three times successful in doing so by randomly shutting down some nodes cleanly and make sure at least one node still has the shard while indexing a few million documents from a few reducers in Hadoop.

        Here's one snippet where a node cannot download:

        2012-11-20 14:44:17,365 INFO [solr.handler.ReplicationHandler] - [RecoveryThread] - : Commits will be reserved for  10000
        2012-11-20 14:44:17,365 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnect
        ionsPerHost=10000
        2012-11-20 14:44:17,367 INFO [solr.core.SolrCore] - [http-8080-exec-4] - : [shard_a] webapp=/solr path=/admin/ping params={} status=503 QTime=0 
        2012-11-20 14:44:17,377 INFO [solr.handler.SnapPuller] - [RecoveryThread] - :  No value set for 'pollInterval'. Timer Task not started.
        2012-11-20 14:44:17,440 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Master's generation: 62
        2012-11-20 14:44:17,440 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Slave's generation: 62
        2012-11-20 14:44:17,440 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting replication process
        2012-11-20 14:44:17,449 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Number of files in latest index in master: 287
        2012-11-20 14:44:17,450 WARN [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : No lockType configured for NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_j/data/index.20121120144417450 lockFactory=org.apache.lucene.store.NativeFSLockFactory@5166b0df; maxCacheMB=48.0 maxMergeSizeMB=4.0) assuming 'simple'
        2012-11-20 14:44:17,450 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : return new directory for /opt/solr/cores/shard_j/data/index.20121120144417450 forceNew:false
        2012-11-20 14:44:17,450 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data
        2012-11-20 14:44:17,451 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting download to NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_j/data/index.20121120144417450 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@3ade1b36; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true
        2012-11-20 14:44:18,298 WARN [solr.handler.SnapPuller] - [RecoveryThread] - : Error in fetching packets 
        java.io.EOFException
                at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151)
                at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144)
                at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143)
                at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107)
                at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
                at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
                at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
                at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
                at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
        2012-11-20 14:44:18,314 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data/index
        2012-11-20 14:44:18,314 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_j/data/index.20121120144417450 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@3ade1b36; maxCacheMB=48.0 maxMergeSizeMB=4.0)
        2012-11-20 14:44:18,372 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data/index.20121120144417450
        2012-11-20 14:44:18,374 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data/index
        2012-11-20 14:44:18,374 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _1t6.fdt completely. Downloaded 9437184!=9501680
                at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237)
                at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118)
                at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
                at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
                at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
                at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
                at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
        
        2012-11-20 14:44:18,374 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed.
                at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
                at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
        
        2012-11-20 14:44:18,374 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=tlog{file=/opt/solr/cores/shard_j/data/tlog/tlog.0000000000000000059 refcount=1}}
        2012-11-20 14:44:18,374 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (0) core=shard_j
        2012-11-20 14:44:18,374 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 2.0 seconds before trying to recover again (1)
        

        And here's a snippet where a node has trouble with it's core instance directory:

        2012-11-20 14:45:54,863 ERROR [solr.core.DirectoryFactory] - [RecoveryThread] - : Error deleting files from Directory:org.apache.lucene.store.NoSuchDirectoryException: directory '/opt/solr/cores/shard_b/data/index.20121120144554468' does not exist
                at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217)
                at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240)
                at org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132)
                at org.apache.solr.core.DirectoryFactory.empty(DirectoryFactory.java:174)
                at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:469)
                at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
                at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
                at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
        
        2012-11-20 14:45:54,866 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_b/data/index.20121120144554468
        2012-11-20 14:45:54,866 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_b/data/index.20121120122241377
        2012-11-20 14:45:54,866 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _22j_Lucene41_0.tim completely. Downloaded 4194304!=4498323
                at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237)
                at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118)
                at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
                at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
                at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
                at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
                at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
        
        2012-11-20 14:45:54,866 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed.
                at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155)
                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
                at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
        
        2012-11-20 14:45:54,866 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=tlog{file=/opt/solr/cores/shard_b/data/tlog/tlog.0000000000000000059 refcount=1}}
        2012-11-20 14:45:54,867 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (0) core=shard_b
        2012-11-20 14:45:54,867 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 2.0 seconds before trying to recover again (1)
        2012-11-20 14:45:54,979 WARN [solr.handler.SnapPuller] - [RecoveryThread] - : Error in fetching packets 
        java.io.EOFException
                at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151)
                at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144)
                at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143)
                at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107)
                at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716)
                at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387)
                at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
                at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152)
                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
                at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
        2012-11-20 14:45:54,985 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_a/data/index
        

        There AIOOBException seems to sometimes accompany when other exceptions occur.

        2012-11-20 14:46:06,963 INFO [solr.handler.ReplicationHandler] - [http-8080-exec-6] - : Commits will be reserved for  10000
        2012-11-20 14:46:06,963 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : [shard_a] webapp=/solr path=/replication params={command=indexversion&wt=javabin&qt=/replication&version=2} status=0 QTime=1 
        2012-11-20 14:46:06,967 INFO [solr.core.CachingDirectoryFactory] - [http-8080-exec-1] - : Releasing directory:/opt/solr/cores/shard_a/data
        2012-11-20 14:46:06,971 INFO [solr.core.CachingDirectoryFactory] - [http-8080-exec-1] - : Releasing directory:/opt/solr/cores/shard_a/data/index
        2012-11-20 14:46:06,971 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_a] webapp=/solr path=/replication params={command=filelist&generation=63&wt=javabin&qt=/replication&version=2} status=0 QTime=4 
        2012-11-20 14:46:06,982 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/replication params={file=_216_Lucene41_0.doc&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream} status=0 QTime=2 
        2012-11-20 14:46:06,989 INFO [solr.core.CachingDirectoryFactory] - [http-8080-exec-5] - : Releasing directory:/opt/solr/cores/shard_a/data/index
        2012-11-20 14:46:06,997 INFO [solr.core.CachingDirectoryFactory] - [http-8080-exec-5] - : Releasing directory:/opt/solr/cores/shard_j/data/index
        2012-11-20 14:46:07,018 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : [shard_a] webapp=/solr path=/admin/system params={wt=json} status=0 QTime=5 
        2012-11-20 14:46:07,031 INFO [solr.core.SolrCore] - [http-8080-exec-4] - : [shard_a] webapp=/solr path=/replication params={file=_216_Lucene41_0.pos&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream} status=0 QTime=0 
        2012-11-20 14:46:07,067 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/replication params={file=_22q_Lucene41_0.tip&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream} status=0 QTime=0 
        2012-11-20 14:46:07,070 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_a] webapp=/solr path=/replication params={file=_1tu_Lucene41_0.doc&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream} status=0 QTime=0 
        2012-11-20 14:46:07,132 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : [shard_a] webapp=/solr path=/replication params={file=_1t6.fdt&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream} status=0 QTime=0 
        2012-11-20 14:46:07,334 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/admin/ping params={} hits=29 status=0 QTime=15 
        2012-11-20 14:46:07,334 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/admin/ping params={} status=0 QTime=15 
        2012-11-20 14:46:07,353 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_a] webapp=/solr path=/admin/ping params={} hits=29 status=0 QTime=17 
        2012-11-20 14:46:07,353 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_a] webapp=/solr path=/admin/ping params={} status=0 QTime=17 
        2012-11-20 14:46:07,483 WARN [solr.handler.ReplicationHandler] - [http-8080-exec-6] - : Exception while writing response for params: file=_1t6.fdt&command=filecontent&checksum=true&generation=63&qt=/replication&wt=filestream
        java.io.EOFException: read past EOF: MMapIndexInput(path="/opt/solr/cores/shard_a/data/index/_1t6.fdt")
                at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100)
                at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
                at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
                at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
                at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
                at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
                at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
                at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
                at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
                at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
                at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
                at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
                at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:662)
        2012-11-20 14:46:07,494 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/replication params={file=_1t6.fdt&command=filecontent&checksum=true&offset=4194304&generation=63&qt=/replication&wt=filestream} status=0 QTime=1 
        2012-11-20 14:46:07,495 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-2] - : null:java.lang.IndexOutOfBoundsException
                at java.nio.Buffer.checkBounds(Buffer.java:530)
                at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218)
                at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91)
                at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065)
                at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932)
                at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
                at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
                at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
                at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
                at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
                at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
                at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
                at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
                at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
                at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
                at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274)
                at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:662)
        
        

        I'm positive that the above exceptions do not occur when i index a few million docs without taking down nodes.

        You're right, it's on trunk indeed, not 4.x.

        Show
        Markus Jelsma added a comment - I'm not exactly sure on how to precisely reproduce this condition but i've been three times successful in doing so by randomly shutting down some nodes cleanly and make sure at least one node still has the shard while indexing a few million documents from a few reducers in Hadoop. Here's one snippet where a node cannot download: 2012-11-20 14:44:17,365 INFO [solr.handler.ReplicationHandler] - [RecoveryThread] - : Commits will be reserved for 10000 2012-11-20 14:44:17,365 INFO [solrj.impl.HttpClientUtil] - [RecoveryThread] - : Creating new http client, config:connTimeout=5000&socketTimeout=20000&allowCompression= false &maxConnections=10000&maxConnect ionsPerHost=10000 2012-11-20 14:44:17,367 INFO [solr.core.SolrCore] - [http-8080-exec-4] - : [shard_a] webapp=/solr path=/admin/ping params={} status=503 QTime=0 2012-11-20 14:44:17,377 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : No value set for 'pollInterval'. Timer Task not started. 2012-11-20 14:44:17,440 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Master's generation: 62 2012-11-20 14:44:17,440 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Slave's generation: 62 2012-11-20 14:44:17,440 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting replication process 2012-11-20 14:44:17,449 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Number of files in latest index in master: 287 2012-11-20 14:44:17,450 WARN [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : No lockType configured for NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_j/data/index.20121120144417450 lockFactory=org.apache.lucene.store.NativeFSLockFactory@5166b0df; maxCacheMB=48.0 maxMergeSizeMB=4.0) assuming 'simple' 2012-11-20 14:44:17,450 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : return new directory for /opt/solr/cores/shard_j/data/index.20121120144417450 forceNew: false 2012-11-20 14:44:17,450 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data 2012-11-20 14:44:17,451 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : Starting download to NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_j/data/index.20121120144417450 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@3ade1b36; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy= true 2012-11-20 14:44:18,298 WARN [solr.handler.SnapPuller] - [RecoveryThread] - : Error in fetching packets java.io.EOFException at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151) at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107) at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-20 14:44:18,314 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data/index 2012-11-20 14:44:18,314 INFO [solr.handler.SnapPuller] - [RecoveryThread] - : removing temporary index download directory files NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_j/data/index.20121120144417450 lockFactory=org.apache.lucene.store.SimpleFSLockFactory@3ade1b36; maxCacheMB=48.0 maxMergeSizeMB=4.0) 2012-11-20 14:44:18,372 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data/index.20121120144417450 2012-11-20 14:44:18,374 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_j/data/index 2012-11-20 14:44:18,374 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _1t6.fdt completely. Downloaded 9437184!=9501680 at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118) at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-20 14:44:18,374 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-20 14:44:18,374 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=tlog{file=/opt/solr/cores/shard_j/data/tlog/tlog.0000000000000000059 refcount=1}} 2012-11-20 14:44:18,374 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (0) core=shard_j 2012-11-20 14:44:18,374 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 2.0 seconds before trying to recover again (1) And here's a snippet where a node has trouble with it's core instance directory: 2012-11-20 14:45:54,863 ERROR [solr.core.DirectoryFactory] - [RecoveryThread] - : Error deleting files from Directory:org.apache.lucene.store.NoSuchDirectoryException: directory '/opt/solr/cores/shard_b/data/index.20121120144554468' does not exist at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:217) at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:240) at org.apache.lucene.store.NRTCachingDirectory.listAll(NRTCachingDirectory.java:132) at org.apache.solr.core.DirectoryFactory.empty(DirectoryFactory.java:174) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:469) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-20 14:45:54,866 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_b/data/index.20121120144554468 2012-11-20 14:45:54,866 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_b/data/index.20121120122241377 2012-11-20 14:45:54,866 ERROR [solr.handler.ReplicationHandler] - [RecoveryThread] - : SnapPull failed :org.apache.solr.common.SolrException: Unable to download _22j_Lucene41_0.tim completely. Downloaded 4194304!=4498323 at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1237) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1118) at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-20 14:45:54,866 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:155) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-20 14:45:54,866 INFO [solr.update.UpdateLog] - [RecoveryThread] - : Dropping buffered updates FSUpdateLog{state=BUFFERING, tlog=tlog{file=/opt/solr/cores/shard_b/data/tlog/tlog.0000000000000000059 refcount=1}} 2012-11-20 14:45:54,867 ERROR [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Recovery failed - trying again... (0) core=shard_b 2012-11-20 14:45:54,867 INFO [solr.cloud.RecoveryStrategy] - [RecoveryThread] - : Wait 2.0 seconds before trying to recover again (1) 2012-11-20 14:45:54,979 WARN [solr.handler.SnapPuller] - [RecoveryThread] - : Error in fetching packets java.io.EOFException at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151) at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:144) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1143) at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1107) at org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:716) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:387) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:152) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222) 2012-11-20 14:45:54,985 INFO [solr.core.CachingDirectoryFactory] - [RecoveryThread] - : Releasing directory:/opt/solr/cores/shard_a/data/index There AIOOBException seems to sometimes accompany when other exceptions occur. 2012-11-20 14:46:06,963 INFO [solr.handler.ReplicationHandler] - [http-8080-exec-6] - : Commits will be reserved for 10000 2012-11-20 14:46:06,963 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : [shard_a] webapp=/solr path=/replication params={command=indexversion&wt=javabin&qt=/replication&version=2} status=0 QTime=1 2012-11-20 14:46:06,967 INFO [solr.core.CachingDirectoryFactory] - [http-8080-exec-1] - : Releasing directory:/opt/solr/cores/shard_a/data 2012-11-20 14:46:06,971 INFO [solr.core.CachingDirectoryFactory] - [http-8080-exec-1] - : Releasing directory:/opt/solr/cores/shard_a/data/index 2012-11-20 14:46:06,971 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_a] webapp=/solr path=/replication params={command=filelist&generation=63&wt=javabin&qt=/replication&version=2} status=0 QTime=4 2012-11-20 14:46:06,982 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/replication params={file=_216_Lucene41_0.doc&command=filecontent&checksum= true &generation=63&qt=/replication&wt=filestream} status=0 QTime=2 2012-11-20 14:46:06,989 INFO [solr.core.CachingDirectoryFactory] - [http-8080-exec-5] - : Releasing directory:/opt/solr/cores/shard_a/data/index 2012-11-20 14:46:06,997 INFO [solr.core.CachingDirectoryFactory] - [http-8080-exec-5] - : Releasing directory:/opt/solr/cores/shard_j/data/index 2012-11-20 14:46:07,018 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : [shard_a] webapp=/solr path=/admin/system params={wt=json} status=0 QTime=5 2012-11-20 14:46:07,031 INFO [solr.core.SolrCore] - [http-8080-exec-4] - : [shard_a] webapp=/solr path=/replication params={file=_216_Lucene41_0.pos&command=filecontent&checksum= true &generation=63&qt=/replication&wt=filestream} status=0 QTime=0 2012-11-20 14:46:07,067 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/replication params={file=_22q_Lucene41_0.tip&command=filecontent&checksum= true &generation=63&qt=/replication&wt=filestream} status=0 QTime=0 2012-11-20 14:46:07,070 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_a] webapp=/solr path=/replication params={file=_1tu_Lucene41_0.doc&command=filecontent&checksum= true &generation=63&qt=/replication&wt=filestream} status=0 QTime=0 2012-11-20 14:46:07,132 INFO [solr.core.SolrCore] - [http-8080-exec-6] - : [shard_a] webapp=/solr path=/replication params={file=_1t6.fdt&command=filecontent&checksum= true &generation=63&qt=/replication&wt=filestream} status=0 QTime=0 2012-11-20 14:46:07,334 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/admin/ping params={} hits=29 status=0 QTime=15 2012-11-20 14:46:07,334 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/admin/ping params={} status=0 QTime=15 2012-11-20 14:46:07,353 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_a] webapp=/solr path=/admin/ping params={} hits=29 status=0 QTime=17 2012-11-20 14:46:07,353 INFO [solr.core.SolrCore] - [http-8080-exec-1] - : [shard_a] webapp=/solr path=/admin/ping params={} status=0 QTime=17 2012-11-20 14:46:07,483 WARN [solr.handler.ReplicationHandler] - [http-8080-exec-6] - : Exception while writing response for params: file=_1t6.fdt&command=filecontent&checksum= true &generation=63&qt=/replication&wt=filestream java.io.EOFException: read past EOF: MMapIndexInput(path= "/opt/solr/cores/shard_a/data/index/_1t6.fdt" ) at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100) at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065) at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662) 2012-11-20 14:46:07,494 INFO [solr.core.SolrCore] - [http-8080-exec-2] - : [shard_a] webapp=/solr path=/replication params={file=_1t6.fdt&command=filecontent&checksum= true &offset=4194304&generation=63&qt=/replication&wt=filestream} status=0 QTime=1 2012-11-20 14:46:07,495 ERROR [solr.servlet.SolrDispatchFilter] - [http-8080-exec-2] - : null :java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkBounds(Buffer.java:530) at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:218) at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:91) at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065) at org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:425) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2274) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662) I'm positive that the above exceptions do not occur when i index a few million docs without taking down nodes. You're right, it's on trunk indeed, not 4.x.
        Hide
        Mark Miller added a comment -

        Okay, I can reproduce this in a test now - the test just has to index many docs - I tried 1 million just to be sure and I see issues - I think I was trying in the 10's of thousands before.

        Show
        Mark Miller added a comment - Okay, I can reproduce this in a test now - the test just has to index many docs - I tried 1 million just to be sure and I see issues - I think I was trying in the 10's of thousands before.
        Hide
        Eks Dev added a comment -

        We see it as well,

        it looks like it only happens with NRTCachingDirectory, but take this statement with healthy suspicion. It went ok only once without NRTCachingDirectory.

        Show
        Eks Dev added a comment - We see it as well, it looks like it only happens with NRTCachingDirectory, but take this statement with healthy suspicion. It went ok only once without NRTCachingDirectory.
        Hide
        Mark Miller added a comment -

        I'll try and make a fix soon.

        Show
        Mark Miller added a comment - I'll try and make a fix soon.
        Hide
        Markus Jelsma added a comment -

        Might be related to SOLR-4033 regarding NRTDir.

        Show
        Markus Jelsma added a comment - Might be related to SOLR-4033 regarding NRTDir.
        Hide
        Mark Miller added a comment -

        It's just silly bugs because the tests never replicated a file larger than the buffer size. I've added a @Nightly test that does.

        Show
        Mark Miller added a comment - It's just silly bugs because the tests never replicated a file larger than the buffer size. I've added a @Nightly test that does.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Mark Robert Miller
        http://svn.apache.org/viewvc?view=revision&revision=1414428

        SOLR-4032: Files larger than an internal buffer size fail to replicate.

        Show
        Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1414428 SOLR-4032 : Files larger than an internal buffer size fail to replicate.
        Hide
        Mark Miller added a comment -

        Thanks Markus!

        Show
        Mark Miller added a comment - Thanks Markus!
        Hide
        Markus Jelsma added a comment - - edited

        Great work, it seems this issue is resolved indeed as i cannot reproduce this exact problem. But another EOF exception pops up, i'll open a new issue.
        edit: another issue popped up as well, added SOLR-4116 and SOLR-4117

        Show
        Markus Jelsma added a comment - - edited Great work, it seems this issue is resolved indeed as i cannot reproduce this exact problem. But another EOF exception pops up, i'll open a new issue. edit: another issue popped up as well, added SOLR-4116 and SOLR-4117
        Hide
        Shalin Shekhar Mangar added a comment -

        The fix was released in 4.1

        Show
        Shalin Shekhar Mangar added a comment - The fix was released in 4.1

          People

          • Assignee:
            Mark Miller
            Reporter:
            Markus Jelsma
          • Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development