Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-1398

Topic config changes can be lost and cause fatal exceptions on broker restarts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.8.1
    • 0.8.1.1
    • None
    • None

    Description

      Our topic config cleanup policy seems to be broken. When a broker is
      bounced and starting up:
      1 - Read all the children of the config change path
      2 - For each, if the change id is greater than the last executed change,
      then extract the topic information.
      3 - If there is a log for that topic on this broker, then apply the change.
      However, if there is no log, then delete the config change.

      In step 3, a delete triggers a child change watch firing on all the other
      brokers. The other brokers currently take all the children of the config
      path but will ignore those config changes that are less than the last
      executed change. At least one issue here is that if a broker does not have
      partitions for a topic then the lastExecutedChange is not updated (for
      that topic).

      Consider this scenario:

      • Three brokers 0, 1, 2
      • Topic A has partitions only assigned to broker 0
      • Topic B has partitions only assigned to broker 1
      • Topic C has partitions only assigned to broker 2
      • Change 0: topic A
      • Change 1: topic B
      • Change 2: topic C
      • lastExecutedChange on broker 0 is 0
      • lastExecutedChange on broker 1 is 1
      • lastExecutedChange on broker 2 is 2
      • Bounce broker 1
      • The above bounce will cause Change 0 and Change 2 to get deleted.
      • Watch fires on broker 0 and 1
      • Broker 0 will try and read the topic corresponding to change 1 (since its
        lastExecutedChange is 0) and then change 2. That read will fail:

      2014/04/15 19:35:34.236 INFO [TopicConfigManager] [main] [kafka-server] [] Processed topic config change 25 for topic xyz, setting new config to

      {retention.ms=3600000, segment.ms=3600000}

      .
      2014/04/15 19:35:34.238 FATAL [KafkaServerStartable] [main] [kafka-server] [] Fatal error during KafkaServerStable startup. Prepare to shutdown
      org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /config/changes/config_change_0000000026
      at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
      at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685)
      at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
      at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
      at kafka.utils.ZkUtils$.readData(ZkUtils.scala:467)
      at kafka.server.TopicConfigManager$$anonfun$kafka$server$TopicConfigManager$$processConfigChanges$2.apply(TopicConfigManager.scala:97)
      at kafka.server.TopicConfigManager$$anonfun$kafka$server$TopicConfigManager$$processConfigChanges$2.apply(TopicConfigManager.scala:93)
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:57)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:43)
      at kafka.server.TopicConfigManager.kafka$server$TopicConfigManager$$processConfigChanges(TopicConfigManager.scala:93)
      at kafka.server.TopicConfigManager.processAllConfigChanges(TopicConfigManager.scala:81)
      at kafka.server.TopicConfigManager.startup(TopicConfigManager.scala:72)
      at kafka.server.KafkaServer.startup(KafkaServer.scala:104)
      at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:34)
      ...
      Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /config/changes/config_change_0000000026
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:956)
      at org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103)
      at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:770)
      at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:766)
      at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
      ... 39 more

      Another issue is that there are two logging statements with incorrect
      qualifiers which makes things a little harder to debug. E.g.,

      2014/04/15 19:35:34.223 ERROR [TopicConfigManager] [kafka-server] [] Ignoring topic config change %d for topic %s since the change has expired

      Attachments

        1. KAFKA-1398.patch
          6 kB
          Jay Kreps
        2. KAFKA-1398.patch
          7 kB
          Jay Kreps
        3. KAFKA-1398.patch
          7 kB
          Jay Kreps
        4. KAFKA-1398_2014-04-18_13:03:03.patch
          3 kB
          Jay Kreps

        Issue Links

          Activity

            People

              jkreps Jay Kreps
              jjkoshy Joel Jacob Koshy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: