Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3975

Zookeeper crashes: Unable to load database on disk java.io.IOException: Unreasonable length

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 3.6.2
    • Fix Version/s: None
    • Component/s: jute
    • Labels:
      None
    • Environment:

      Debian 10 x64

      openjdk version "11.0.8" 2020-07-14
      OpenJDK Runtime Environment (build 11.0.8+10-post-Debian-1deb10u1)
      OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Debian-1deb10u1, mixed mode, sharing)

      Description

      After running for a while, the entire cluster (3 zookeeper) crash suddenly, all of them logging:

       

      2020-10-16 10:37:00,459 [myid:2] - WARN [NIOWorkerThread-4:NIOServerCnxn@373] - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
      2020-10-16 10:37:00,475 [myid:2] - ERROR [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1139] - Unable to load database on disk
      java.io.IOException: Unreasonable length = 5089607
              at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
              at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
              at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
              at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768)
              at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:352)
              at org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:258)
              at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:303)
              at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
              at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1093)
              at org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1249)
              at org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:868)
              at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:941)
              at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1428)

      Apparently the "corrupted" file appears in all the servers, so no solution such as "removing version-2 on the faulty server and letting replicate from a healthy one" .

      The entire cluster goes down, we have downtime, every-single-day since we upgraded from 3.4.9. 

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tanisdlj Diego Lucas Jiménez
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: