Solr
  1. Solr
  2. SOLR-6583

Resuming connection with ZooKeeper causes log replay

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.10.1
    • Fix Version/s: 5.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      If a node is partitioned from ZooKeeper for an extended period of time then upon resuming connection, the node re-registers itself causing recoverFromLog() method to be executed which fails with the following exception:

      8091124 [Thread-71] ERROR org.apache.solr.update.UpdateLog  – Error inspecting tlog tlog{file=/home/ubuntu/shalin-lusolr/solr/example/solr/collection_5x3_shard5_replica3/data/tlog/tlog.0000000000000009869 refcount=2}
      java.nio.channels.ClosedChannelException
              at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99)
              at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:678)
              at org.apache.solr.update.ChannelFastInputStream.readWrappedStream(TransactionLog.java:784)
              at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
              at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:125)
              at java.io.InputStream.read(InputStream.java:101)
              at org.apache.solr.update.TransactionLog.endsWithCommit(TransactionLog.java:218)
              at org.apache.solr.update.UpdateLog.recoverFromLog(UpdateLog.java:800)
              at org.apache.solr.cloud.ZkController.register(ZkController.java:834)
              at org.apache.solr.cloud.ZkController$1.command(ZkController.java:271)
              at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
      8091125 [Thread-71] ERROR org.apache.solr.update.UpdateLog  – Error inspecting tlog tlog{file=/home/ubuntu/shalin-lusolr/solr/example/solr/collection_5x3_shard5_replica3/data/tlog/tlog.0000000000000009870 refcount=2}
      java.nio.channels.ClosedChannelException
              at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99)
              at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:678)
              at org.apache.solr.update.ChannelFastInputStream.readWrappedStream(TransactionLog.java:784)
              at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
              at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:125)
              at java.io.InputStream.read(InputStream.java:101)
              at org.apache.solr.update.TransactionLog.endsWithCommit(TransactionLog.java:218)
              at org.apache.solr.update.UpdateLog.recoverFromLog(UpdateLog.java:800)
              at org.apache.solr.cloud.ZkController.register(ZkController.java:834)
              at org.apache.solr.cloud.ZkController$1.command(ZkController.java:271)
              at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
      

      This is because the recoverFromLog uses transaction log references that were collected at startup and are no longer valid.

      We shouldn't even be running recoverFromLog code for ZK re-connect.

        Issue Links

          Activity

          Hide
          Jesse Sipprell added a comment -

          Just an FYI, I also am seeing a number of these when bringing all nodes up on a previously shutdown SolrCloud with a populated zk tree, although this hardly is unexpected given the hypothesized cause.

          It does seem fairly minor based on the fact that I haven't noticed any ramifications regarding elections or otherwise bringing everything up irrespective of previous state, but I thought I would add Yet Another Data Point.

          Show
          Jesse Sipprell added a comment - Just an FYI, I also am seeing a number of these when bringing all nodes up on a previously shutdown SolrCloud with a populated zk tree, although this hardly is unexpected given the hypothesized cause. It does seem fairly minor based on the fact that I haven't noticed any ramifications regarding elections or otherwise bringing everything up irrespective of previous state, but I thought I would add Yet Another Data Point.
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          Hi Jesse Sipprell, what you are seeing is probably a different bug because if you are bringing up dead nodes then this error shouldn't happen.

          As I said in the description:

          This is because the recoverFromLog uses transaction log references that were collected at startup and are no longer valid.

          If you start a node then the log references collected at startup should be valid and the recoverFromLog method should definitely succeed. Which version of Solr are you using? How often do you see this error and is it easily reproducible?

          Show
          Shalin Shekhar Mangar added a comment - - edited Hi Jesse Sipprell , what you are seeing is probably a different bug because if you are bringing up dead nodes then this error shouldn't happen. As I said in the description: This is because the recoverFromLog uses transaction log references that were collected at startup and are no longer valid. If you start a node then the log references collected at startup should be valid and the recoverFromLog method should definitely succeed. Which version of Solr are you using? How often do you see this error and is it easily reproducible?
          Hide
          Yago Riveiro added a comment -

          This is happening in Solr 4.6.1 too.

          ERROR - app2 - 2014-12-01 21:30:42.820; org.apache.solr.update.UpdateLog; Error inspecting tlog tlog{file=/solr/node/collections/collection1_shard2_replica1/data/tlog/tlog.0000000000000001284 refcount=2}
          
          Show
          Yago Riveiro added a comment - This is happening in Solr 4.6.1 too. ERROR - app2 - 2014-12-01 21:30:42.820; org.apache.solr.update.UpdateLog; Error inspecting tlog tlog{file=/solr/node/collections/collection1_shard2_replica1/data/tlog/tlog.0000000000000001284 refcount=2}
          Hide
          Markus Jelsma added a comment -
          Show
          Markus Jelsma added a comment - Seems this is on 4.10.4 as well. I have never seen it happen before http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201503.mbox/%3Czarafa.54fc424e.0fe2.1b6ac4ae3e74e461@mail.openindex.io%3E
          Hide
          Mark Miller added a comment -

          I linked to SOLR-7338, as that is fairly related to this issue and my sample patch attached there should address this.

          Show
          Mark Miller added a comment - I linked to SOLR-7338 , as that is fairly related to this issue and my sample patch attached there should address this.
          Hide
          ASF subversion and git services added a comment -

          Commit 1671554 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1671554 ]

          SOLR-7338: A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583

          Show
          ASF subversion and git services added a comment - Commit 1671554 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1671554 ] SOLR-7338 : A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583
          Hide
          ASF subversion and git services added a comment -

          Commit 1671562 from Timothy Potter in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1671562 ]

          SOLR-7338: A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583

          Show
          ASF subversion and git services added a comment - Commit 1671562 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1671562 ] SOLR-7338 : A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583
          Hide
          ASF subversion and git services added a comment -

          Commit 1671570 from Timothy Potter in branch 'dev/branches/lucene_solr_5_1'
          [ https://svn.apache.org/r1671570 ]

          SOLR-7338: A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583

          Show
          ASF subversion and git services added a comment - Commit 1671570 from Timothy Potter in branch 'dev/branches/lucene_solr_5_1' [ https://svn.apache.org/r1671570 ] SOLR-7338 : A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583
          Hide
          Shalin Shekhar Mangar added a comment -

          This was fixed by SOLR-7338

          Show
          Shalin Shekhar Mangar added a comment - This was fixed by SOLR-7338

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development