Kafka
  1. Kafka
  2. KAFKA-686

0.8 Kafka broker should give a better error message when running against 0.7 zookeeper

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.3
    • Component/s: None
    • Labels:

      Description

      People will not know that the zookeeper paths are not compatible. When you try to start the 0.8 broker pointed at a 0.7 zookeeper you get a NullPointerException. We should detect this and give a more sane error.

      Error:
      kafka.common.KafkaException: Can't parse json string: null
      at kafka.utils.Json$.liftedTree1$1(Json.scala:20)
      at kafka.utils.Json$.parseFull(Json.scala:16)
      at kafka.utils.ZkUtils$$anonfun$getReplicaAssignmentForTopics$2.apply(ZkUtils.scala:498)
      at kafka.utils.ZkUtils$$anonfun$getReplicaAssignmentForTopics$2.apply(ZkUtils.scala:494)
      at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
      at scala.collection.immutable.List.foreach(List.scala:45)
      at kafka.utils.ZkUtils$.getReplicaAssignmentForTopics(ZkUtils.scala:494)
      at kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:446)
      at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:220)
      at kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:85)
      at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:53)
      at kafka.server.ZookeeperLeaderElector.startup(ZookeeperLeaderElector.scala:43)
      at kafka.controller.KafkaController.startup(KafkaController.scala:381)
      at kafka.server.KafkaServer.startup(KafkaServer.scala:90)
      at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:34)
      at kafka.Kafka$.main(Kafka.scala:46)
      at kafka.Kafka.main(Kafka.scala)
      Caused by: java.lang.NullPointerException
      at scala.util.parsing.combinator.lexical.Scanners$Scanner.<init>(Scanners.scala:52)
      at scala.util.parsing.json.JSON$.parseRaw(JSON.scala:71)
      at scala.util.parsing.json.JSON$.parseFull(JSON.scala:85)
      at kafka.utils.Json$.liftedTree1$1(Json.scala:17)
      ... 16 more

      1. KAFAK-686-null-pointer-fix.patch
        0.8 kB
        Phil Hargett
      2. KAFKA-686-null-pointer-fix-2.patch
        3 kB
        Viktor Taranenko

        Activity

        Jay Kreps created issue -
        Phil Hargett made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.8 [ 12317244 ]
        Labels newbie patch
        Phil Hargett made changes -
        Comment [ While this doesn't produce a message clarifying the error, it does prevent the error in the first place. I am a novice at Scala, but I think there is a bug in ZkClient that was wrapping a null value in a Some instance--which caused the Json.parseFull to fail, as it received a null string to parse.

        This patch attempts to detect the null from Zookeeper and return a tuple with None instead of Some(null). ]
        Hide
        Phil Hargett added a comment - - edited

        While this doesn't produce a message clarifying the error, it does prevent the error in the first place. I am a novice at Scala, but I think there is a bug in ZkClient that was wrapping a null value in a Some instance--which caused the Json.parseFull to fail, as it received a null string to parse.

        This patch attempts to detect the null from Zookeeper and return a tuple with None instead of Some(null).

        Patch here using git diff

        diff --git a/core/src/main/scala/kafka/utils/ZkUtils.scala b/core/src/main/scala/kafka/utils/ZkUtils.scala
        index c6119d9..90ee2d0 100644
        — a/core/src/main/scala/kafka/utils/ZkUtils.scala
        +++ b/core/src/main/scala/kafka/utils/ZkUtils.scala
        @@ -419,7 +419,12 @@ object ZkUtils extends Logging {
        def readDataMaybeNull(client: ZkClient, path: String): (Option[String], Stat) = {
        val stat: Stat = new Stat()
        val dataAndStat = try {

        • (Some(client.readData(path, stat)), stat)
          + val data: String = client.readData(path,stat)
          + if(data != null) { + (Some(client.readData(path, stat)), stat) + }

          else

          { + (None,stat) + }

          } catch {
          case e: ZkNoNodeException =>
          (None, stat)

        Show
        Phil Hargett added a comment - - edited While this doesn't produce a message clarifying the error, it does prevent the error in the first place. I am a novice at Scala, but I think there is a bug in ZkClient that was wrapping a null value in a Some instance--which caused the Json.parseFull to fail, as it received a null string to parse. This patch attempts to detect the null from Zookeeper and return a tuple with None instead of Some(null). Patch here using git diff diff --git a/core/src/main/scala/kafka/utils/ZkUtils.scala b/core/src/main/scala/kafka/utils/ZkUtils.scala index c6119d9..90ee2d0 100644 — a/core/src/main/scala/kafka/utils/ZkUtils.scala +++ b/core/src/main/scala/kafka/utils/ZkUtils.scala @@ -419,7 +419,12 @@ object ZkUtils extends Logging { def readDataMaybeNull(client: ZkClient, path: String): (Option [String] , Stat) = { val stat: Stat = new Stat() val dataAndStat = try { (Some(client.readData(path, stat)), stat) + val data: String = client.readData(path,stat) + if(data != null) { + (Some(client.readData(path, stat)), stat) + } else { + (None,stat) + } } catch { case e: ZkNoNodeException => (None, stat)
        Neha Narkhede made changes -
        Priority Major [ 3 ] Blocker [ 1 ]
        Hide
        Phil Hargett added a comment -

        Here's the diff as an attachment, for easier usage.

        Show
        Phil Hargett added a comment - Here's the diff as an attachment, for easier usage.
        Phil Hargett made changes -
        Attachment KAFAK-686-null-pointer-fix.patch [ 12573105 ]
        Hide
        Swapnil Ghike added a comment -

        How do we differentiate whether the 0.8 broker is pointing at 0.7 zookeeper data, or the zookeeper data for 0.8 got messed up (using zkclient etc) by mistake? Perhaps the right solution is to rename the zookeeper directory for 0.8.

        Show
        Swapnil Ghike added a comment - How do we differentiate whether the 0.8 broker is pointing at 0.7 zookeeper data, or the zookeeper data for 0.8 got messed up (using zkclient etc) by mistake? Perhaps the right solution is to rename the zookeeper directory for 0.8.
        Hide
        Phil Hargett added a comment -

        While, yes, the original issue is unaddressed, there is a lurking bug in ZkUtils: a null pointer is getting wrapped in Some when it shouldn't--it's misleading. The callers of this ZkUtils method even expect that None could be returned, so there is no need to mishandle null. To be fair, the method could probably be named readDataMaybeNone, but who am I to quibble.

        It would seem like a higher-order check for 0.8 vs. 0.7 Zookeeper data is in order. That is, perhaps on startup Kafka should check the state of Zookeeper data, and report an error/warning at that time. Low-level APIs like this shouldn't necessarily have to worry about version mismatch; it could be expensive if they did. Relying upon a misleading exception (NullPointerException) rather than something that specifically diagnoses the issue would seem an incorrect choice.

        That's my $.02 after having only spent ~2 days in this code, but hope this helps.

        Show
        Phil Hargett added a comment - While, yes, the original issue is unaddressed, there is a lurking bug in ZkUtils: a null pointer is getting wrapped in Some when it shouldn't--it's misleading. The callers of this ZkUtils method even expect that None could be returned, so there is no need to mishandle null. To be fair, the method could probably be named readDataMaybeNone, but who am I to quibble. It would seem like a higher-order check for 0.8 vs. 0.7 Zookeeper data is in order. That is, perhaps on startup Kafka should check the state of Zookeeper data, and report an error/warning at that time. Low-level APIs like this shouldn't necessarily have to worry about version mismatch; it could be expensive if they did. Relying upon a misleading exception (NullPointerException) rather than something that specifically diagnoses the issue would seem an incorrect choice. That's my $.02 after having only spent ~2 days in this code, but hope this helps.
        Joe Stein made changes -
        Fix Version/s 0.8.1 [ 12322960 ]
        Fix Version/s 0.8 [ 12317244 ]
        Neha Narkhede made changes -
        Fix Version/s 0.8.2 [ 12326167 ]
        Fix Version/s 0.8.1 [ 12322960 ]
        Hide
        Viktor Taranenko added a comment -

        Hi, there is another patch to fix NullPointerException

        In Scala

        Some(null)

        returns Option container with null value inside, whereas

        Option(null)

        returns None

        In addition I cleared code a bit. I don't really know if you welcome patches with code styling improvements. If so I can do further refactoring in that class. There are big level of indentation, mutable structures and some of duplicate code

        Show
        Viktor Taranenko added a comment - Hi, there is another patch to fix NullPointerException In Scala Some( null ) returns Option container with null value inside, whereas Option( null ) returns None In addition I cleared code a bit. I don't really know if you welcome patches with code styling improvements. If so I can do further refactoring in that class. There are big level of indentation, mutable structures and some of duplicate code
        Viktor Taranenko made changes -
        Attachment KAFKA-686-null-pointer-fix-2.patch [ 12662323 ]
        Hide
        Guozhang Wang added a comment -

        Thanks Phil Hargett and Viktor Taranenko for your patches.

        A general thought is that we may probably want to go one step further to avoid the brokers to be unnecessarily fallen into the bad state. One example I have seen before is that a 0.7 consumer mistakenly writes to a 0.8 Zookeeper may cause the controller to fail and the whole cluster in a bad state. To be more specific, according to the ZK data structure:

        https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper

        1. The controller would log the error and "exclude" the information when encounter error in parsing a) consumer offsets, b) admin paths since these data can be written by clients or admin tools.
        2. The controller would fail gracefully (i.e. log the error and shutdown the server itself) when encounter error in parsing a) broker registration info, b) controller epoch, c) controller registration, d) topic registration info, and e) partition state info, since these data can only be written by the brokers themselves.
        3. The broker should fail gracefully when encounter error in parsing a) partition state info.

        Some other comments:

        1. readDataMaybeNull is used in other places of ZkUtils and other classes, and they may also be modified accordingly.
        2. I am wondering does Exception.failAsValue exist in scala version 2.8/9 also? We are planning to retire scala 2.8 support but have not finished yet.

        Will you have time recently to continue working on this ticket?

        Show
        Guozhang Wang added a comment - Thanks Phil Hargett and Viktor Taranenko for your patches. A general thought is that we may probably want to go one step further to avoid the brokers to be unnecessarily fallen into the bad state. One example I have seen before is that a 0.7 consumer mistakenly writes to a 0.8 Zookeeper may cause the controller to fail and the whole cluster in a bad state. To be more specific, according to the ZK data structure: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper 1. The controller would log the error and "exclude" the information when encounter error in parsing a) consumer offsets, b) admin paths since these data can be written by clients or admin tools. 2. The controller would fail gracefully (i.e. log the error and shutdown the server itself) when encounter error in parsing a) broker registration info, b) controller epoch, c) controller registration, d) topic registration info, and e) partition state info, since these data can only be written by the brokers themselves. 3. The broker should fail gracefully when encounter error in parsing a) partition state info. Some other comments: 1. readDataMaybeNull is used in other places of ZkUtils and other classes, and they may also be modified accordingly. 2. I am wondering does Exception.failAsValue exist in scala version 2.8/9 also? We are planning to retire scala 2.8 support but have not finished yet. Will you have time recently to continue working on this ticket?
        Hide
        Guozhang Wang added a comment -

        This is a better-to-have in 0.8.2, if Phil Hargett and Viktor Taranenko do not have time now we can push to 0.9.

        Show
        Guozhang Wang added a comment - This is a better-to-have in 0.8.2, if Phil Hargett and Viktor Taranenko do not have time now we can push to 0.9.
        Hide
        Viktor Taranenko added a comment -

        Guozhang Wang I'm happy to put some effort there over this weekend. When do you guys want to release 0.8.2?

        I see there is no 'Exception.failAsValue' exists in Scala 2.8. I can make it compatible if required

        Show
        Viktor Taranenko added a comment - Guozhang Wang I'm happy to put some effort there over this weekend. When do you guys want to release 0.8.2? I see there is no 'Exception.failAsValue' exists in Scala 2.8. I can make it compatible if required
        Hide
        Joe Stein added a comment -

        0.8.2 drops support for Scala 2.8

        Show
        Joe Stein added a comment - 0.8.2 drops support for Scala 2.8
        Hide
        Viktor Taranenko added a comment -

        So I can consider 2.9 minimal version while solving this issue?

        Show
        Viktor Taranenko added a comment - So I can consider 2.9 minimal version while solving this issue?
        Hide
        Joe Stein added a comment -

        2.9.1 yes

        Show
        Joe Stein added a comment - 2.9.1 yes
        Hide
        Viktor Taranenko added a comment -

        Guozhang Wang

        I've put initial changeset on reviewboard.

        https://reviews.apache.org/r/25420/diff/

        It contains some code refactoring to centralise getting and parsing data from ZK and implementation of shutting down the controller when it receives invalid topic exception while trying to sync with new topics.

        I would to get initial feedback if that is the behaviour you would expect or I'm missing something.
        If everything is fine so far, I will continue working on it this week.

        Thanks

        Show
        Viktor Taranenko added a comment - Guozhang Wang I've put initial changeset on reviewboard. https://reviews.apache.org/r/25420/diff/ It contains some code refactoring to centralise getting and parsing data from ZK and implementation of shutting down the controller when it receives invalid topic exception while trying to sync with new topics. I would to get initial feedback if that is the behaviour you would expect or I'm missing something. If everything is fine so far, I will continue working on it this week. Thanks
        Jun Rao made changes -
        Priority Blocker [ 1 ] Major [ 3 ]
        Hide
        Neha Narkhede added a comment -

        Viktor Taranenko, took a quick look at the patch. Could you clarify how failAsValue on line 455 in ZkUtils.scala solves the original NPE problem reported in the JIRA?

        Show
        Neha Narkhede added a comment - Viktor Taranenko , took a quick look at the patch. Could you clarify how failAsValue on line 455 in ZkUtils.scala solves the original NPE problem reported in the JIRA?
        Hide
        Viktor Taranenko added a comment -

        Neha Narkhede, It is not 'failAsValue' what solves the problem, but wrapping potential null value into scala Option instead if Some.

        failAsValue only shortens code for catching ZkNoNodeException and returning None of it occurs

        failAsValue(classOf[ZkNoNodeException])(None)(Option(client.readData(path, stat)))
        
        • returns 'None' when client.readData throws ZkNoNodeException (doesn't catch any other kind of Exception from client.readData)
        • returns 'None' when client.readData returns null
        • returns 'Some(value)' when client.readData returns string value

        The original code had `Some(client.readData)`, which returns Some(null), so that for example

        `Some(null).map(_.length)` would throw NullPointerException

        Show
        Viktor Taranenko added a comment - Neha Narkhede , It is not 'failAsValue' what solves the problem, but wrapping potential null value into scala Option instead if Some. failAsValue only shortens code for catching ZkNoNodeException and returning None of it occurs failAsValue(classOf[ZkNoNodeException])(None)(Option(client.readData(path, stat))) returns 'None' when client.readData throws ZkNoNodeException (doesn't catch any other kind of Exception from client.readData) returns 'None' when client.readData returns null returns 'Some(value)' when client.readData returns string value The original code had `Some(client.readData)`, which returns Some(null), so that for example `Some(null).map(_.length)` would throw NullPointerException
        Hide
        Neha Narkhede added a comment - - edited

        Thanks.

        Show
        Neha Narkhede added a comment - - edited Thanks.
        Joe Stein made changes -
        Fix Version/s 0.8.3 [ 12328745 ]
        Fix Version/s 0.8.2 [ 12326167 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Jay Kreps
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development