Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Review the logs and do the following:
      1. Fix confusing or duplicative messages
      2. Assess that messages are at the right level (TRACE/DEBUG/INFO/WARN/ERROR)

      It would also be nice to add a log4j logger for the request logging (i.e. the access log) and another for the controller state change log, since these really have their own use cases.

      1. kafka-649_extra.patch
        7 kB
        Jun Rao
      2. kafka-649.patch
        13 kB
        Jun Rao
      3. KAFKA-649.v3.patch
        33 kB
        Guozhang Wang
      4. KAFKA-649.v4.patch
        6 kB
        Guozhang Wang
      5. KAFKA-649.v5.patch
        6 kB
        Guozhang Wang
      6. KAFKA-649.v6.patch
        5 kB
        Guozhang Wang

        Activity

        Hide
        Jay Kreps added a comment -

        Closing as it looks like this was checked in.

        Show
        Jay Kreps added a comment - Closing as it looks like this was checked in.
        Hide
        Jun Rao added a comment -

        Thanks for patch v6. Committed to 0.8.

        Show
        Jun Rao added a comment - Thanks for patch v6. Committed to 0.8.
        Hide
        Guozhang Wang added a comment -

        Thanks for the comments.

        50,51,52: Done.

        Show
        Guozhang Wang added a comment - Thanks for the comments. 50,51,52: Done.
        Hide
        Jun Rao added a comment -

        Thanks for patch v5.

        50. ConsoleConsumer: There is no need to change the error to fatal since it indicates a command line error.

        51. KafkaApis: Not sure if we need to make this change. If there is an error in getting metadata, the client will get an error code. Logging this in the broker may not help and can pollute the log.

        52. SimpleConsumer: Just logging the message of the exception is enough. It should include the type of the exception in the message.

        Show
        Jun Rao added a comment - Thanks for patch v5. 50. ConsoleConsumer: There is no need to change the error to fatal since it indicates a command line error. 51. KafkaApis: Not sure if we need to make this change. If there is an error in getting metadata, the client will get an error code. Logging this in the broker may not help and can pollute the log. 52. SimpleConsumer: Just logging the message of the exception is enough. It should include the type of the exception in the message.
        Hide
        Guozhang Wang added a comment -

        One left-bracket mistyping fix.

        Show
        Guozhang Wang added a comment - One left-bracket mistyping fix.
        Hide
        Guozhang Wang added a comment -

        Thanks for the comments Jun.

        30,31,32,33. Done.

        34. OK

        Show
        Guozhang Wang added a comment - Thanks for the comments Jun. 30,31,32,33. Done. 34. OK
        Hide
        Jun Rao added a comment -

        Thanks for patch v3.

        30. KafkaApi.readMessageSets(): We don't know how we hit the exception. So a stacktrace could be useful.

        31. BoundedByteBufferReceive.byteBufferAllocate(): This is the case that we don't know who the caller is. So a stacktrace could be useful.

        32. MirrorMakerThread: It's probably better to prefix the log indentation with the thread name.

        33. SimpleConsumer: We should log the message of the exception. We just don't need to show the stack trace.

        34. About capitalize the first letters. This is a good change. However, it should probably be done in trunk since trunk is where we want to do non-trivial code cleanup. Could you file a separate jira to tract that?

        Show
        Jun Rao added a comment - Thanks for patch v3. 30. KafkaApi.readMessageSets(): We don't know how we hit the exception. So a stacktrace could be useful. 31. BoundedByteBufferReceive.byteBufferAllocate(): This is the case that we don't know who the caller is. So a stacktrace could be useful. 32. MirrorMakerThread: It's probably better to prefix the log indentation with the thread name. 33. SimpleConsumer: We should log the message of the exception. We just don't need to show the stack trace. 34. About capitalize the first letters. This is a good change. However, it should probably be done in trunk since trunk is where we want to do non-trivial code cleanup. Could you file a separate jira to tract that?
        Hide
        Guozhang Wang added a comment -

        Guidelines of cleaning:

        1. Only errors causing the server to exit should be FATAL.
        2. Otherwise, only errors that will cause message loss or corruption would be ERROR, others would be WARN.
        3. Only log the trace if it's not clear how the exception is hit.
        4. Capitalize first letters.

        Show
        Guozhang Wang added a comment - Guidelines of cleaning: 1. Only errors causing the server to exit should be FATAL. 2. Otherwise, only errors that will cause message loss or corruption would be ERROR, others would be WARN. 3. Only log the trace if it's not clear how the exception is hit. 4. Capitalize first letters.
        Hide
        Guozhang Wang added a comment -

        Reopened for some more cleaning

        Show
        Guozhang Wang added a comment - Reopened for some more cleaning
        Hide
        Jay Kreps added a comment -

        Looks fixed.

        Show
        Jay Kreps added a comment - Looks fixed.
        Hide
        Jun Rao added a comment -

        Thanks for the review. Addressed the comment and committed the extra patch to 0.8.

        Show
        Jun Rao added a comment - Thanks for the review. Addressed the comment and committed the extra patch to 0.8.
        Hide
        Neha Narkhede added a comment -

        Actually, there is another error message that needs to be fixed in KafkaApis -

        In handleTopicMetadataRequest() -

        error("Error while fetching metadata for partition %s".format(topicAndPartition), e)

        Here, we should special case LeaderNotAvailable and ReplicaNotAvailable. And only print error if it is anything other than that. This will reduce the log pollution due to metadata requests during leader failover

        Show
        Neha Narkhede added a comment - Actually, there is another error message that needs to be fixed in KafkaApis - In handleTopicMetadataRequest() - error("Error while fetching metadata for partition %s".format(topicAndPartition), e) Here, we should special case LeaderNotAvailable and ReplicaNotAvailable. And only print error if it is anything other than that. This will reduce the log pollution due to metadata requests during leader failover
        Hide
        Neha Narkhede added a comment -

        +1. Just one comment -

        KafkaApis: Can we please change "fails" to "failed"

        Show
        Neha Narkhede added a comment - +1. Just one comment - KafkaApis: Can we please change "fails" to "failed"
        Hide
        Jun Rao added a comment -

        Attach an extra patch that fixes some more logging issues that I saw. It also addresses issues identified by Neha that haven't already been fixed. I didn't address the issue that Swapnil brought up. The logging is already in warning and I can't think of a better logging message.

        Show
        Jun Rao added a comment - Attach an extra patch that fixes some more logging issues that I saw. It also addresses issues identified by Neha that haven't already been fixed. I didn't address the issue that Swapnil brought up. The logging is already in warning and I can't think of a better logging message.
        Hide
        Swapnil Ghike added a comment -

        In DefaultEventhandler, we have a few warn("Failed to send .."). Perhaps we should change them to warn("Need to resend..")?

        Show
        Swapnil Ghike added a comment - In DefaultEventhandler, we have a few warn("Failed to send .."). Perhaps we should change them to warn("Need to resend..")?
        Hide
        Neha Narkhede added a comment -

        These are good changes. Few more suggestions -

        1. A pattern that needs change is "topic %s partition %d". We need to standardize on "[%s,%d]" so that grepping through the logs for a particular topic/partition is easy.

        2. In OfflinePartitionLeaderSelector.selectLeader(),

        warn("No broker in ISR is alive for %s. Elect leader from broker %s. There's potential data loss."
        .format(topicAndPartition, newLeader))
        needs to change to
        warn("No broker in ISR is alive for %s. Elect leader %d from brokers %s. There's potential data loss."
        .format(topicAndPartition, newLeader, liveAssignedReplicasToThisPartition.mkString(",")))

        3. Similarly, change
        debug("Some broker in ISR is alive for %s. Select %d from ISR to be the leader.".format(topicAndPartition, newLeader))
        to
        debug("Some broker in ISR is alive for %s. Select %d from ISR %s to be the leader.".format(topicAndPartition, newLeader, liveBrokersInIsr.mkString(",")))

        4. KafkaApis
        As mentioned in #1, we want to move away from "topic %s partition %d" -
        error("Error when processing fetch request for topic %s partition %d offset %d from %s with correlation id %d"
        .format(topic, partition, offset, if (isFetchFromFollower) "follower" else "consumer", fetchRequest.correlationId),
        t)

        Show
        Neha Narkhede added a comment - These are good changes. Few more suggestions - 1. A pattern that needs change is "topic %s partition %d". We need to standardize on " [%s,%d] " so that grepping through the logs for a particular topic/partition is easy. 2. In OfflinePartitionLeaderSelector.selectLeader(), warn("No broker in ISR is alive for %s. Elect leader from broker %s. There's potential data loss." .format(topicAndPartition, newLeader)) needs to change to warn("No broker in ISR is alive for %s. Elect leader %d from brokers %s. There's potential data loss." .format(topicAndPartition, newLeader, liveAssignedReplicasToThisPartition.mkString(","))) 3. Similarly, change debug("Some broker in ISR is alive for %s. Select %d from ISR to be the leader.".format(topicAndPartition, newLeader)) to debug("Some broker in ISR is alive for %s. Select %d from ISR %s to be the leader.".format(topicAndPartition, newLeader, liveBrokersInIsr.mkString(","))) 4. KafkaApis As mentioned in #1, we want to move away from "topic %s partition %d" - error("Error when processing fetch request for topic %s partition %d offset %d from %s with correlation id %d" .format(topic, partition, offset, if (isFetchFromFollower) "follower" else "consumer", fetchRequest.correlationId), t)
        Hide
        Jun Rao added a comment -

        Thanks for the review. Committed to 0.8 after addressing all issues except #2. ISR is used in too many places. It's probably better to clean it up in trunk. Otherwise, merging to trunk will be hard.

        These are issues that I found from the log during a typical deployment of Kafka. Will keep it open for some time in case new critical ones come up.

        Show
        Jun Rao added a comment - Thanks for the review. Committed to 0.8 after addressing all issues except #2. ISR is used in too many places. It's probably better to clean it up in trunk. Otherwise, merging to trunk will be hard. These are issues that I found from the log during a typical deployment of Kafka. Will keep it open for some time in case new critical ones come up.
        Hide
        Jay Kreps added a comment -

        +1
        A few minor issues:
        1. Typo: "No broker is ISR is alive"
        2. Can we spell out ISR. "In-sync replica set" is a little more wordy but it is immediately obvious what it means. My overall concern is that our logging is impossible to read for anyone but a Kafka engineer.
        3. correlationId => correlation id
        4. controllerEpoch => controllerEpoch
        5. zkPath => zookeeper path

        All of these fixes look good. Do you feel after this patch the logs are clean enough to be ready for the official 0.8 release or is this just one installment?

        Show
        Jay Kreps added a comment - +1 A few minor issues: 1. Typo: "No broker is ISR is alive" 2. Can we spell out ISR. "In-sync replica set" is a little more wordy but it is immediately obvious what it means. My overall concern is that our logging is impossible to read for anyone but a Kafka engineer. 3. correlationId => correlation id 4. controllerEpoch => controllerEpoch 5. zkPath => zookeeper path All of these fixes look good. Do you feel after this patch the logs are clean enough to be ready for the official 0.8 release or is this just one installment?
        Hide
        Jun Rao added a comment -

        A lot of the logging has already been fixed in other patches. Made another pass and attach a patch. It fixes the following:

        1. On startup, we currently log the loading of every data and index segment. That's probably too much. Instead, only log the logEndOffset after loading the log.

        2. Added topic/partition info in a few places where this is useful.

        3. Added the type of request in a few places in KafkaApis to make it clear what request is the error for.

        4. Moved a few less important logging from info to debug.

        Show
        Jun Rao added a comment - A lot of the logging has already been fixed in other patches. Made another pass and attach a patch. It fixes the following: 1. On startup, we currently log the loading of every data and index segment. That's probably too much. Instead, only log the logEndOffset after loading the log. 2. Added topic/partition info in a few places where this is useful. 3. Added the type of request in a few places in KafkaApis to make it clear what request is the error for. 4. Moved a few less important logging from info to debug.
        Hide
        Neha Narkhede added a comment -

        From KAFKA-696 review -

        Looks like we haven't standardized StopReplicaRequest to use TopicAndPartition instead of a custom Tuple for representing partitions. This messes up the log4j logging style and the toString() API will not print partitions the way rest of the code does. You can either include it in your patch or leave it for KAFKA-649

        Show
        Neha Narkhede added a comment - From KAFKA-696 review - Looks like we haven't standardized StopReplicaRequest to use TopicAndPartition instead of a custom Tuple for representing partitions. This messes up the log4j logging style and the toString() API will not print partitions the way rest of the code does. You can either include it in your patch or leave it for KAFKA-649
        Hide
        Jun Rao added a comment -

        I can take a look at this.

        Show
        Jun Rao added a comment - I can take a look at this.
        Hide
        Neha Narkhede added a comment -

        Jay, would you like to take this on ?

        Show
        Neha Narkhede added a comment - Jay, would you like to take this on ?
        Hide
        Jay Kreps added a comment -

        Hey guys, I think this is really just a basic thing we have to do prior to a 0.8 release, it won't take long. Without this the software doesn't feel very polished because the log is completely unintelligible.

        Show
        Jay Kreps added a comment - Hey guys, I think this is really just a basic thing we have to do prior to a 0.8 release, it won't take long. Without this the software doesn't feel very polished because the log is completely unintelligible.

          People

          • Assignee:
            Jun Rao
            Reporter:
            Jay Kreps
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development