Solr
  1. Solr
  2. SOLR-4116

Log Replay [recoveryExecutor-8-thread-1] - : java.io.EOFException

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 5.1
    • Fix Version/s: 5.5, 6.0
    • Component/s: SolrCloud
    • Labels:
      None
    • Environment:

      5.0.0.2012.11.28.10.42.06
      Debian Squeeze, Tomcat 6, Sun Java 6, 10 nodes, 10 shards, rep. factor 2.

      Description

      With SOLR-4032 fixed we see other issues when randomly taking down nodes (nicely via tomcat restart) while indexing a few million web pages from Hadoop. We do make sure that at least one node is up for a shard but due to recovery issues it may not be live.

      2012-11-28 11:32:33,086 WARN [solr.update.UpdateLog] - [recoveryExecutor-8-thread-1] - : Starting log replay tlog{file=/opt/solr/cores/openindex_e/data/tlog/tlog.0000000000000000028 refcount=2} active=false starting pos=0
      2012-11-28 11:32:41,873 ERROR [solr.update.UpdateLog] - [recoveryExecutor-8-thread-1] - : java.io.EOFException
              at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:151)
              at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:479)
              at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:176)
              at org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:374)
              at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:225)
              at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:451)
              at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:182)
              at org.apache.solr.update.TransactionLog$LogReader.next(TransactionLog.java:618)
              at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1198)
              at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1143)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
              at java.util.concurrent.FutureTask.run(FutureTask.java:138)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
              at java.util.concurrent.FutureTask.run(FutureTask.java:138)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          I don't know what "tomcat restart" does, but perhaps it's not as nice as you think if it causes a log replay on restart? Anyway, bringing down a server roughly enough (like kill -9) can cause truncated tlog files.
          But truncated log files are expected and should not cause fatal exceptions (and we have tests for that). This exception causes the core not to come up?

          Show
          Yonik Seeley added a comment - I don't know what "tomcat restart" does, but perhaps it's not as nice as you think if it causes a log replay on restart? Anyway, bringing down a server roughly enough (like kill -9) can cause truncated tlog files. But truncated log files are expected and should not cause fatal exceptions (and we have tests for that). This exception causes the core not to come up?
          Hide
          Markus Jelsma added a comment -

          Restarting or stopping Tomcat shuts down CoreContainer and stops recovery, i believe this is nice enough or isn't it? This error does not cause the core not to come up.

          2012-11-28 14:10:15,227 INFO [solr.core.CoreContainer] - [Thread-6] - : Shutting down CoreContainer instance=1830423861
          2012-11-28 14:10:15,227 WARN [solr.cloud.RecoveryStrategy] - [Thread-6] - : Stopping recovery for zkNodeName=178.21.118.195:8080_solr_shard_fcore=shard_f
          2012-11-28 14:10:15,227 WARN [solr.cloud.RecoveryStrategy] - [Thread-6] - : Stopping recovery for zkNodeName=178.21.118.195:8080_solr_shard_gcore=shard_g
          2012-11-28 14:10:15,227 INFO [solr.core.SolrCore] - [Thread-6] - : [shard_f]  CLOSING SolrCore org.apache.solr.core.SolrCore@513c952f
          2012-11-28 14:10:15,230 INFO [solr.update.UpdateHandler] - [Thread-6] - : closing DirectUpdateHandler2{commits=1,autocommit maxTime=120000ms,autocommits=0,soft autocommit maxTime=10000ms,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0}
          2012-11-28 14:10:15,231 INFO [solr.core.SolrCore] - [Thread-6] - : Closing SolrCoreState
          2012-11-28 14:10:15,231 INFO [solr.update.DefaultSolrCoreState] - [Thread-6] - : SolrCoreState ref count has reached 0 - closing IndexWriter
          2012-11-28 14:10:15,231 INFO [solr.update.DefaultSolrCoreState] - [Thread-6] - : closing IndexWriter with IndexWriterCloser
          2012-11-28 14:10:15,234 INFO [solr.core.CachingDirectoryFactory] - [Thread-6] - : Releasing directory:/opt/solr/cores/shard_f/data/index.20121128113300496
          2012-11-28 14:10:15,235 INFO [solr.core.SolrCore] - [Thread-6] - : [shard_f] Closing main searcher on request.
          2012-11-28 14:10:15,244 INFO [solr.core.CachingDirectoryFactory] - [Thread-6] - : Releasing directory:/opt/solr/cores/shard_f/data/index.20121128113300496
          2012-11-28 14:10:15,244 INFO [solr.core.SolrCore] - [Thread-6] - : [shard_g]  CLOSING SolrCore org.apache.solr.core.SolrCore@24be0446
          2012-11-28 14:10:15,248 INFO [solr.update.UpdateHandler] - [Thread-6] - : closing DirectUpdateHandler2{commits=1,autocommit maxTime=120000ms,autocommits=0,soft autocommit maxTime=10000ms,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0}
          2012-11-28 14:10:15,248 INFO [solr.core.SolrCore] - [Thread-6] - : Closing SolrCoreState
          2012-11-28 14:10:15,248 INFO [solr.update.DefaultSolrCoreState] - [Thread-6] - : SolrCoreState ref count has reached 0 - closing IndexWriter
          2012-11-28 14:10:15,248 INFO [solr.update.DefaultSolrCoreState] - [Thread-6] - : closing IndexWriter with IndexWriterCloser
          2012-11-28 14:10:15,250 INFO [solr.core.CachingDirectoryFactory] - [Thread-6] - : Releasing directory:/opt/solr/cores/shard_g/data/index.20121128113035951
          2012-11-28 14:10:15,250 INFO [solr.core.SolrCore] - [Thread-6] - : [shard_g] Closing main searcher on request.
          2012-11-28 14:10:15,256 INFO [solr.core.CachingDirectoryFactory] - [Thread-6] - : Releasing directory:/opt/solr/cores/shard_g/data/index.20121128113035951
          2012-11-28 14:10:15,281 INFO [apache.zookeeper.ZooKeeper] - [Thread-6] - : Session: 0x13b4668803e000f closed
          
          Show
          Markus Jelsma added a comment - Restarting or stopping Tomcat shuts down CoreContainer and stops recovery, i believe this is nice enough or isn't it? This error does not cause the core not to come up. 2012-11-28 14:10:15,227 INFO [solr.core.CoreContainer] - [ Thread -6] - : Shutting down CoreContainer instance=1830423861 2012-11-28 14:10:15,227 WARN [solr.cloud.RecoveryStrategy] - [ Thread -6] - : Stopping recovery for zkNodeName=178.21.118.195:8080_solr_shard_fcore=shard_f 2012-11-28 14:10:15,227 WARN [solr.cloud.RecoveryStrategy] - [ Thread -6] - : Stopping recovery for zkNodeName=178.21.118.195:8080_solr_shard_gcore=shard_g 2012-11-28 14:10:15,227 INFO [solr.core.SolrCore] - [ Thread -6] - : [shard_f] CLOSING SolrCore org.apache.solr.core.SolrCore@513c952f 2012-11-28 14:10:15,230 INFO [solr.update.UpdateHandler] - [ Thread -6] - : closing DirectUpdateHandler2{commits=1,autocommit maxTime=120000ms,autocommits=0,soft autocommit maxTime=10000ms,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0} 2012-11-28 14:10:15,231 INFO [solr.core.SolrCore] - [ Thread -6] - : Closing SolrCoreState 2012-11-28 14:10:15,231 INFO [solr.update.DefaultSolrCoreState] - [ Thread -6] - : SolrCoreState ref count has reached 0 - closing IndexWriter 2012-11-28 14:10:15,231 INFO [solr.update.DefaultSolrCoreState] - [ Thread -6] - : closing IndexWriter with IndexWriterCloser 2012-11-28 14:10:15,234 INFO [solr.core.CachingDirectoryFactory] - [ Thread -6] - : Releasing directory:/opt/solr/cores/shard_f/data/index.20121128113300496 2012-11-28 14:10:15,235 INFO [solr.core.SolrCore] - [ Thread -6] - : [shard_f] Closing main searcher on request. 2012-11-28 14:10:15,244 INFO [solr.core.CachingDirectoryFactory] - [ Thread -6] - : Releasing directory:/opt/solr/cores/shard_f/data/index.20121128113300496 2012-11-28 14:10:15,244 INFO [solr.core.SolrCore] - [ Thread -6] - : [shard_g] CLOSING SolrCore org.apache.solr.core.SolrCore@24be0446 2012-11-28 14:10:15,248 INFO [solr.update.UpdateHandler] - [ Thread -6] - : closing DirectUpdateHandler2{commits=1,autocommit maxTime=120000ms,autocommits=0,soft autocommit maxTime=10000ms,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0} 2012-11-28 14:10:15,248 INFO [solr.core.SolrCore] - [ Thread -6] - : Closing SolrCoreState 2012-11-28 14:10:15,248 INFO [solr.update.DefaultSolrCoreState] - [ Thread -6] - : SolrCoreState ref count has reached 0 - closing IndexWriter 2012-11-28 14:10:15,248 INFO [solr.update.DefaultSolrCoreState] - [ Thread -6] - : closing IndexWriter with IndexWriterCloser 2012-11-28 14:10:15,250 INFO [solr.core.CachingDirectoryFactory] - [ Thread -6] - : Releasing directory:/opt/solr/cores/shard_g/data/index.20121128113035951 2012-11-28 14:10:15,250 INFO [solr.core.SolrCore] - [ Thread -6] - : [shard_g] Closing main searcher on request. 2012-11-28 14:10:15,256 INFO [solr.core.CachingDirectoryFactory] - [ Thread -6] - : Releasing directory:/opt/solr/cores/shard_g/data/index.20121128113035951 2012-11-28 14:10:15,281 INFO [apache.zookeeper.ZooKeeper] - [ Thread -6] - : Session: 0x13b4668803e000f closed
          Hide
          Markus Jelsma added a comment - - edited

          Awesome, i got a strange exception, searched the web for it and ended up here, again! This time it is Solr 5.1, i completely removed the data directory, restarted and indexed 18 million records. Then restarted Solr again, and got this slightly different stack trace:

          10400 [recoveryExecutor-19-thread-1] ERROR org.apache.solr.update.UpdateLog  [   ] – java.io.EOFException
                  at org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73)
                  at org.apache.solr.common.util.FastInputStream.readLong(FastInputStream.java:240)
                  at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:211)
                  at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492)
                  at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
                  at org.apache.solr.update.TransactionLog$LogReader.next(TransactionLog.java:642)
                  at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1305)
                  at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1233)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                  at java.lang.Thread.run(Thread.java:745)
          

          There is no further real malfunction so this error could be ignored, but shouldnt occur anyway.

          Show
          Markus Jelsma added a comment - - edited Awesome, i got a strange exception, searched the web for it and ended up here, again! This time it is Solr 5.1, i completely removed the data directory, restarted and indexed 18 million records. Then restarted Solr again, and got this slightly different stack trace: 10400 [recoveryExecutor-19-thread-1] ERROR org.apache.solr.update.UpdateLog [ ] – java.io.EOFException at org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73) at org.apache.solr.common.util.FastInputStream.readLong(FastInputStream.java:240) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:211) at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:492) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186) at org.apache.solr.update.TransactionLog$LogReader.next(TransactionLog.java:642) at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1305) at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1233) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) There is no further real malfunction so this error could be ignored, but shouldnt occur anyway.
          Hide
          Mark Miller added a comment -

          Perhaps related to https://issues.apache.org/jira/browse/SOLR-7478

          Seems like the tlog was likely not closed properly if reading it shows corruption.

          Show
          Mark Miller added a comment - Perhaps related to https://issues.apache.org/jira/browse/SOLR-7478 Seems like the tlog was likely not closed properly if reading it shows corruption.
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk move to 5.4 after 5.3 release.

          Show
          Shalin Shekhar Mangar added a comment - Bulk move to 5.4 after 5.3 release.
          Hide
          Mike Drob added a comment -

          I saw similar stack traces recently with the same EOFException when running on HDFS. Looking at the javadoc on LogReader.next it suggests that we should return null in case of EOF, or is that only when we hit a "nice" EOF at a record boundary?

          Alternatively, if we're going to continue throwing the EOF to signal that a file is probably corrupt (or at least truncated), we should include the file name in the message.

          Show
          Mike Drob added a comment - I saw similar stack traces recently with the same EOFException when running on HDFS. Looking at the javadoc on LogReader.next it suggests that we should return null in case of EOF, or is that only when we hit a "nice" EOF at a record boundary? Alternatively, if we're going to continue throwing the EOF to signal that a file is probably corrupt (or at least truncated), we should include the file name in the message.
          Hide
          Mark Miller added a comment -

          Looks like with tomcat you have to crank the unloadDelay for the webapp context to be sure there is enough time for the final commit and the shutdown stuff that could take a while.

          Show
          Mark Miller added a comment - Looks like with tomcat you have to crank the unloadDelay for the webapp context to be sure there is enough time for the final commit and the shutdown stuff that could take a while.
          Hide
          Mark Miller added a comment -

          Hmm...that for how long a request can finish, but this stuff should be how long it allows for filter#destroy I think. Jetty has a setting for this if I recall correctly, but not finding tomcats version.

          Show
          Mark Miller added a comment - Hmm...that for how long a request can finish, but this stuff should be how long it allows for filter#destroy I think. Jetty has a setting for this if I recall correctly, but not finding tomcats version.
          Hide
          Erick Erickson added a comment -

          May be related, adding link just in case.

          Show
          Erick Erickson added a comment - May be related, adding link just in case.
          Hide
          Mark Miller added a comment -

          Markus Jelsma, if you see this again, could you dump the shutdown logs?

          Show
          Mark Miller added a comment - Markus Jelsma , if you see this again, could you dump the shutdown logs?
          Hide
          Markus Jelsma added a comment -

          Yes of course! Btw, the last time i saw the error, we were already using Jetty (Solr 5.1).

          Show
          Markus Jelsma added a comment - Yes of course! Btw, the last time i saw the error, we were already using Jetty (Solr 5.1).

            People

            • Assignee:
              Unassigned
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development