Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-6649

ReplicaFetcher stopped after non fatal exception is thrown

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11.0.2, 1.0.0, 1.0.1, 1.1.0
    • Fix Version/s: None
    • Component/s: replication
    • Labels:
      None

      Description

      We have seen several under-replication partitions, usually triggered by topic creation. After digging in the logs, we see the below:

      [2018-03-12 22:40:17,641] ERROR [ReplicaFetcher replicaId=12, leaderId=0, fetcherId=1] Error due to (kafka.server.ReplicaFetcherThread)
      kafka.common.KafkaException: Error processing data for partition [[TOPIC_NAME_REMOVED]]-84 offset 2098535
       at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
       at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169)
       at scala.Option.foreach(Option.scala:257)
       at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
       at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166)
       at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
       at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166)
       at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
       at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
       at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
       at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164)
       at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
       at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
      Caused by: org.apache.kafka.common.errors.OffsetOutOfRangeException: Cannot increment the log start offset to 2098535 of partition [[TOPIC_NAME_REMOVED]]-84 since it is larger than the high watermark -1
      [2018-03-12 22:40:17,641] INFO [ReplicaFetcher replicaId=12, leaderId=0, fetcherId=1] Stopped (kafka.server.ReplicaFetcherThread)

      It looks like that after the ReplicaFetcherThread is stopped, the replicas start to lag behind, presumably because we are not fetching from the leader anymore. Further examining, the ShutdownableThread.scala object:

      override def run(): Unit = {
       info("Starting")
       try {
         while (isRunning)
           doWork()
       } catch {
         case e: FatalExitError =>
           shutdownInitiated.countDown()
           shutdownComplete.countDown()
           info("Stopped")
           Exit.exit(e.statusCode())
         case e: Throwable =>
           if (isRunning)
             error("Error due to", e)
       } finally {
         shutdownComplete.countDown()
       }
       info("Stopped")
      }

      For the Throwable (non-fatal) case, it just exits the while loop and the thread stops doing work. I am not sure whether this is the intended behavior of the ShutdownableThread, or the exception should be caught and we should keep calling doWork()

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                julion Julio Ng
              • Votes:
                1 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: