Kafka
  1. Kafka
  2. KAFKA-998

Producer should not retry on non-recoverable error codes

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0, 0.8.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.

      If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

      1. KAFKA-998.v1.patch
        10 kB
        Guozhang Wang

        Activity

        Hide
        Guozhang Wang added a comment -

        Approach proposal:

        1. Passing the errorCode from send to dispatchSerializedData to handle along with the outstandingProduceRequests.

        2. When outstandingProduceRequests.size > 0, check the corresponding error code, if the error code indicates a non-avoidable error such as MessageSizeTooLarge, then break the while loop intermediately.

        By doing so it would also be easier in the future if we want to make a smarted producer client by dynamically shrinking the batch size.

        Show
        Guozhang Wang added a comment - Approach proposal: 1. Passing the errorCode from send to dispatchSerializedData to handle along with the outstandingProduceRequests. 2. When outstandingProduceRequests.size > 0, check the corresponding error code, if the error code indicates a non-avoidable error such as MessageSizeTooLarge, then break the while loop intermediately. By doing so it would also be easier in the future if we want to make a smarted producer client by dynamically shrinking the batch size.
        Hide
        Guozhang Wang added a comment -

        Needs rebase once 0.8 is merged in trunk.

        Show
        Guozhang Wang added a comment - Needs rebase once 0.8 is merged in trunk.
        Hide
        Guozhang Wang added a comment -

        1. Add a function in ErrorMapping, which takes an error code and decides if this error is fatal. For now only MessageSizeTooLarge is fatal.

        2. If the error code is fatal, set needRetry to false.

        3. If needRetry is false, break the while loop in handle.

        Show
        Guozhang Wang added a comment - 1. Add a function in ErrorMapping, which takes an error code and decides if this error is fatal. For now only MessageSizeTooLarge is fatal. 2. If the error code is fatal, set needRetry to false. 3. If needRetry is false, break the while loop in handle.
        Hide
        Jason Rosenberg added a comment -

        The caller of Producer.send() should also have the ability to know whether a send failure is recoverable (that might succeed with more retries). It may be hard for the client developer to guess the right number of message.send.max.retries, otherwise (since a transient error, like a restarting broker, could take an unknown amount of time). If I want to implement guaranteed semantics, then the client needs to be able to have information on whether to continue retrying a message, or else give up.

        This could be done by having Producer.send() throw different exception types (e.g. different versions of FailedToSendMessageException), e.g. UnrecoverableFailedToSendException or RetriesExhaustedFailedToSendException (perhaps shorter names for these exceptions). These could both be sub-classes of FailedToSendException.

        Another approach might be to have the FailedToSendException return information, such as how many retries were attempted, whether or not the message might be recoverable with more retries, and it should wrap the root cause, so debugging is possible.

        Show
        Jason Rosenberg added a comment - The caller of Producer.send() should also have the ability to know whether a send failure is recoverable (that might succeed with more retries). It may be hard for the client developer to guess the right number of message.send.max.retries, otherwise (since a transient error, like a restarting broker, could take an unknown amount of time). If I want to implement guaranteed semantics, then the client needs to be able to have information on whether to continue retrying a message, or else give up. This could be done by having Producer.send() throw different exception types (e.g. different versions of FailedToSendMessageException), e.g. UnrecoverableFailedToSendException or RetriesExhaustedFailedToSendException (perhaps shorter names for these exceptions). These could both be sub-classes of FailedToSendException. Another approach might be to have the FailedToSendException return information, such as how many retries were attempted, whether or not the message might be recoverable with more retries, and it should wrap the root cause, so debugging is possible.
        Hide
        Jason Rosenberg added a comment -

        Also, a producer can throw QueueFullException. From the client's point of view, it would make sense that this should also be a retryable situation (depending on load). Thus, QueueFullException might make sense to be a sub-class of FailedToSendException (and more likely a sub-class of RetriesExhaustedFailedToSendException (or whatever name that might better be renamed to).

        Show
        Jason Rosenberg added a comment - Also, a producer can throw QueueFullException. From the client's point of view, it would make sense that this should also be a retryable situation (depending on load). Thus, QueueFullException might make sense to be a sub-class of FailedToSendException (and more likely a sub-class of RetriesExhaustedFailedToSendException (or whatever name that might better be renamed to).
        Hide
        Guozhang Wang added a comment -

        Hello Jason,

        I think what you need is to return more information to the caller of Producer.send(), since currently it only returns a FailedToSendMessageException:

        "Failed to send messages after #, tries."

        For this case I think it is better for you to create a separate JIRA. As for dynamically adjust the batch size upon receiving MessageSizeTooLargeException, I will file a separate JIRA for this.

        Show
        Guozhang Wang added a comment - Hello Jason, I think what you need is to return more information to the caller of Producer.send(), since currently it only returns a FailedToSendMessageException: "Failed to send messages after #, tries." For this case I think it is better for you to create a separate JIRA. As for dynamically adjust the batch size upon receiving MessageSizeTooLargeException, I will file a separate JIRA for this.
        Hide
        Jason Rosenberg added a comment -

        Ok, I filed KAFKA-1025 to track the issue for reasoning about whether a failed send should be recoverable.

        Show
        Jason Rosenberg added a comment - Ok, I filed KAFKA-1025 to track the issue for reasoning about whether a failed send should be recoverable.
        Hide
        Joel Koshy added a comment -

        Apologies for the late review. Couple of comments:

        • I think this could reset needRetry back to false if subsequent partitions in the iteration do need a retry: needRetry = needRetry && !fatalException(topicPartitionAndError._2). The logic is actually a bit confusing. Instead, it might be clearer to just do: failedTopicPartitions.exists(<some entry for which we need to retry>)
        • Can you enhance the logging a bit to indicate that there were fatal sends that will not be retried? e.g., "Dropping messages to topic x due to message size limit.." or something like that.
        • Can you rebase?
        Show
        Joel Koshy added a comment - Apologies for the late review. Couple of comments: I think this could reset needRetry back to false if subsequent partitions in the iteration do need a retry: needRetry = needRetry && !fatalException(topicPartitionAndError._2). The logic is actually a bit confusing. Instead, it might be clearer to just do: failedTopicPartitions.exists(<some entry for which we need to retry>) Can you enhance the logging a bit to indicate that there were fatal sends that will not be retried? e.g., "Dropping messages to topic x due to message size limit.." or something like that. Can you rebase?
        Hide
        Guozhang Wang added a comment -

        Thanks for the patch Joel. Do you mean rebase on 0.8 (it was originally on trunk)?

        Show
        Guozhang Wang added a comment - Thanks for the patch Joel. Do you mean rebase on 0.8 (it was originally on trunk)?
        Hide
        Joel Koshy added a comment -

        Oh I thought this was for 0.8 - it does apply on trunk. Do people think this is small and important enough to apply to 0.8?
        Another comment after thinking about the patch: in dispatchSerializedData - would it be better to just drop data that have hit the message size limit? That way, there is no need to return the needRetry, so the dispatchSerializedData signature remains the same. The disadvantage is that we won't propagage a failedtosendmessage exception for such messages to the caller - for the producer in async mode that is probably fine (since right now the caller can't really do much with that exception) - in sync mode the caller could perhaps decide to send fewer messages at once. Even in that case we don't really say which topics/messages hit the message size limit so I think it is fine in that case as well. Furthermore, this would be covered by KAFKA-1026 to a large degree.

        Show
        Joel Koshy added a comment - Oh I thought this was for 0.8 - it does apply on trunk. Do people think this is small and important enough to apply to 0.8? Another comment after thinking about the patch: in dispatchSerializedData - would it be better to just drop data that have hit the message size limit? That way, there is no need to return the needRetry, so the dispatchSerializedData signature remains the same. The disadvantage is that we won't propagage a failedtosendmessage exception for such messages to the caller - for the producer in async mode that is probably fine (since right now the caller can't really do much with that exception) - in sync mode the caller could perhaps decide to send fewer messages at once. Even in that case we don't really say which topics/messages hit the message size limit so I think it is fine in that case as well. Furthermore, this would be covered by KAFKA-1026 to a large degree.
        Hide
        Neha Narkhede added a comment -

        >> Do people think this is small and important enough to apply to 0.8?

        +1.

        Guozhang, do you mind submitting a reviewboard ?

        Show
        Neha Narkhede added a comment - >> Do people think this is small and important enough to apply to 0.8? +1. Guozhang, do you mind submitting a reviewboard ?
        Hide
        Jun Rao added a comment -

        Thanks for the patch. Looks good overall. Just one comment:

        10. ErrorMapping.fatalException(): Should we rename it to unrecoverableException? MessageSizeTooLarge doesn't seems like a fatal exception.

        I am not sure if it's worth patching this in 0.8. The workaround is to reduce the batch size, as well reducing retry times and retry intervals.

        Show
        Jun Rao added a comment - Thanks for the patch. Looks good overall. Just one comment: 10. ErrorMapping.fatalException(): Should we rename it to unrecoverableException? MessageSizeTooLarge doesn't seems like a fatal exception. I am not sure if it's worth patching this in 0.8. The workaround is to reduce the batch size, as well reducing retry times and retry intervals.
        Hide
        Jason Rosenberg added a comment -

        jun rao

        I'd love to see this in 0.8. For a message too large exception, which gets returned to the producer client currently as a FailedToSendMessageException, it's indistinguishable from any other kind of exception, for which sub-dividing the batch and retrying are not viable options.

        The workaround described is not workable, in practice since the FTSME does not include any root cause information (a simple causedBy() method might help in that regard)!

        Show
        Jason Rosenberg added a comment - jun rao I'd love to see this in 0.8. For a message too large exception, which gets returned to the producer client currently as a FailedToSendMessageException, it's indistinguishable from any other kind of exception, for which sub-dividing the batch and retrying are not viable options. The workaround described is not workable, in practice since the FTSME does not include any root cause information (a simple causedBy() method might help in that regard)!
        Hide
        Jun Rao added a comment -

        Jason,

        This patch just won't retry sending the data when hitting a MessageTooLargeException. It doesn't really address you main concern, which is the caller doesn't know the real cause of the failure. Addressing this issue completely will need some more thoughts in the producer logic and the changes required may be non-trivial. So, I am not sure if we should do this in 0.8.

        Show
        Jun Rao added a comment - Jason, This patch just won't retry sending the data when hitting a MessageTooLargeException. It doesn't really address you main concern, which is the caller doesn't know the real cause of the failure. Addressing this issue completely will need some more thoughts in the producer logic and the changes required may be non-trivial. So, I am not sure if we should do this in 0.8.
        Hide
        Jason Rosenberg added a comment -

        Jun,

        Yeah, sorry, I forgot I had filed KAFKA-1025 to address my concerns about exposing recoverability.

        Thanks,

        Jason

        Show
        Jason Rosenberg added a comment - Jun, Yeah, sorry, I forgot I had filed KAFKA-1025 to address my concerns about exposing recoverability. Thanks, Jason
        Hide
        Jason Rosenberg added a comment -

        Jun Rao
        Would it be an easier short term fix, to at least include the root cause set on the FailedToSendMessageException. So, we could see the MessageSizeTooLargeException as the cause of the FTSME? Or is that not easy to do?

        Show
        Jason Rosenberg added a comment - Jun Rao Would it be an easier short term fix, to at least include the root cause set on the FailedToSendMessageException. So, we could see the MessageSizeTooLargeException as the cause of the FTSME? Or is that not easy to do?
        Hide
        Jun Rao added a comment -

        My feeling is that it may not be very easy to do a quick fix. Currently, the cause exceptions are eaten at several places just so that we can pass back unsuccessfully sent messages.

        Show
        Jun Rao added a comment - My feeling is that it may not be very easy to do a quick fix. Currently, the cause exceptions are eaten at several places just so that we can pass back unsuccessfully sent messages.

          People

          • Assignee:
            Guozhang Wang
            Reporter:
            Joel Koshy
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development