Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9050

IndexFetcher not retrying after SocketTimeoutException correctly, which leads to trying a full download again

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 5.3.1
    • Fix Version/s: None
    • Component/s: replication (java)
    • Labels:
      None

      Description

      I'm seeing a problem where reading a large file from the leader (in SolrCloud mode) during index replication leads to a SocketTimeoutException:

      2016-04-28 16:22:23.568 WARN  (RecoveryThread-foo_shard11_replica2) [c:foo s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.IndexFetcher Error in fetching file: _405k.cfs (downloaded 7314866176 of 9990844536 bytes)
      java.net.SocketTimeoutException: Read timed out
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.read(SocketInputStream.java:150)
              at java.net.SocketInputStream.read(SocketInputStream.java:121)
              at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
              at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
              at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
              at org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:253)
              at org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
              at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
              at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
              at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
              at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
              at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:140)
              at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:167)
              at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:161)
              at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1312)
              at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1275)
              at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
      

      and this leads to the following error in cleanup:

      2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: Unable to download _405k.cfs completely. Downloaded 7314866176!=9990844536
              at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1406)
              at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1286)
              at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
              at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:423)
              at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
              at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:380)
              at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
              at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
              at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
      
      2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.c.RecoveryStrategy Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed.
              at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165)
              at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
              at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
      

      So a simple read timeout exception leads to re-downloading the whole index again, and again, and again ...

      It also looks like any exception raised in fetchPackets would be squelched if an exception is raised in cleanup (called in the finally block)

        Attachments

        1. SOLR-9050.patch
          3 kB
          Timothy Potter
        2. SOLR-9050.patch
          4 kB
          Timothy Potter

          Activity

            People

            • Assignee:
              thelabdude Timothy Potter
              Reporter:
              thelabdude Timothy Potter
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: