Kafka
  1. Kafka
  2. KAFKA-1025

Producer.send should provide recoverability info on failiure

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Currently, in 0.8, the Producer.send() method either succeeds, or fails by throwing an Exception.

      There are several exceptions that can be thrown, including:

      FailedToSendException
      QueueFullException
      ClassCastExeption

      These are all sub-classes of RuntimeException.

      Under the covers, the producer will retry sending messages up to a maximum number of times (according to the message.send.max.retries property). Internally, the producer may decide which sorts of failures are recoverable, and will retry those. Alternatively (via an upcoming change, see KAFKA-998), it may decide to not retry at all, if the error is not recoverable.

      The problem is, if FailedToSendException is returned, the caller to Producer.send doesn't have a way to decide if a send failed due to an unrecoverable error, or failed after exhausting a maximum number of retries.

      A caller may want to decide to retry more times, perhaps after waiting a while. But it should know first whether it's even likely that the failure is retryable.

      An example of this might be a if the message size is too large (represented internally as a MessageSizeTooLargeException). In this case, it is not recoverable, but it is still wrapped as a FailedToSendException, and should not be retried.

      So the suggestion is to make clear in the api javadoc (or scaladoc) for Producer.send, the set of exception types that can be thrown (so that we don't have to search through source code to find them). And add exception types, or perhaps fields within FailedToSendException, so that it's possible to reason about whether retrying might make sense.

      Currently, in addition, I've found that Producer.send can throw a QueueFullException in async mode (this should be a retryable exception, after time has elapsed, etc.), and also a ClassCastException, if there's a misconfiguration between the configured Encoder and the message data type. I suspect there are other RuntimeExceptions that can also be thrown (e.g. NullPointerException if the message/topic are null).

        Activity

        Hide
        Jason Rosenberg added a comment -

        any hope for getting this issue some attention in time for 0.8.1?

        Show
        Jason Rosenberg added a comment - any hope for getting this issue some attention in time for 0.8.1?
        Hide
        Joe Stein added a comment -

        Hi Jason, I have a developer on the bench for a few days starting tomorrow that can take a look at this. Its a good chance for him to dive into the internals more so I should be able to get him started on it.

        Show
        Joe Stein added a comment - Hi Jason, I have a developer on the bench for a few days starting tomorrow that can take a look at this. Its a good chance for him to dive into the internals more so I should be able to get him started on it.
        Hide
        Dima Pekar added a comment -

        It's relatively easy to add FailedToSendMessageException#recoverable: Boolean field and to pass it correctly in the
        kafka.producer.async.DefaultEventHandler#handle(events). But unfortunately recoverability info is not accessible
        in context of this method.

        On the other hand, KAFKA-998 already has a patch, providing recoverability info to the context of
        kafka.producer.async.DefaultEventHandler#handle(events) method.

        Suppose, that this should be fixed after KAFKA-998 merge. Any chances to merge KAFKA-998 patch?

        Show
        Dima Pekar added a comment - It's relatively easy to add FailedToSendMessageException#recoverable: Boolean field and to pass it correctly in the kafka.producer.async.DefaultEventHandler#handle(events). But unfortunately recoverability info is not accessible in context of this method. On the other hand, KAFKA-998 already has a patch, providing recoverability info to the context of kafka.producer.async.DefaultEventHandler#handle(events) method. Suppose, that this should be fixed after KAFKA-998 merge. Any chances to merge KAFKA-998 patch?
        Hide
        Joe Stein added a comment -

        Hi Dima, I would say no for two reason. As Jun says in KAFKA-998 there are places where exceptions are swallowed that would have to get looked at and also possibly reworked. I would also argue that since 998 didn't make it into 0.8.0 that such a change (or spending time on it for 0.8.1) is (arguably) API breaking (since folks might have already built logic around exceptions and have expectations around it so doing it in the 0.8.X is not advisable). We would not want to put out 0.8.1 and folks catching an exception for messagetolarge and doing something with it no longer can't (which would be fine though in 0.9.X for API "breaking" changes)

        For this ticket I think just updating the comments for the function to say which exceptions can be thrown so folks can look at the code and know what to expect and so that the javadoc/scaladoc generation would have this information inside of it would be sufficient and helpful... for this ticket

        Show
        Joe Stein added a comment - Hi Dima, I would say no for two reason. As Jun says in KAFKA-998 there are places where exceptions are swallowed that would have to get looked at and also possibly reworked. I would also argue that since 998 didn't make it into 0.8.0 that such a change (or spending time on it for 0.8.1) is (arguably) API breaking (since folks might have already built logic around exceptions and have expectations around it so doing it in the 0.8.X is not advisable). We would not want to put out 0.8.1 and folks catching an exception for messagetolarge and doing something with it no longer can't (which would be fine though in 0.9.X for API "breaking" changes) For this ticket I think just updating the comments for the function to say which exceptions can be thrown so folks can look at the code and know what to expect and so that the javadoc/scaladoc generation would have this information inside of it would be sufficient and helpful... for this ticket
        Hide
        Dima Pekar added a comment -

        Provided patch, containing updated docs for send method of Producer.
        Docs now contains throws clauses for possible exceptions.

        Show
        Dima Pekar added a comment - Provided patch, containing updated docs for send method of Producer. Docs now contains throws clauses for possible exceptions.
        Hide
        Guozhang Wang added a comment -

        Thanks for the patch, non-committer +1.

        Show
        Guozhang Wang added a comment - Thanks for the patch, non-committer +1.
        Hide
        Jason Rosenberg added a comment -

        Joe Stein, I'm not sure I understand your reasoning......and please understand that my primary issue in the ticket, is to provide visibility when we see a MessageSizeTooLargeException. Adding better documentation is nice (and great actually!), but does not address my issue raised in this ticket.

        Regarding KAFKA-998, I think all that does is not repeatedly retry (e.g. 3 times) if the broker responds with a MessageSizeTooLargeException. So, I don't think that's an api breaking change to merge the patch there (not sure what Jun Rao's concerns are there?).

        But certainly, currently, you cannot have any logic in the api that looks for the MessageSizeTooLargeException, so adding that visibility would not break any existing client code. You'd still throw a FailedToSendException as before, I'm just asking for extra info in the exception.

        I literally have TODO's in my code to the likes of "Once KAFKA-1025 is solved, remove this bogus guess-work logic to handle the possibility of a MessageSizeTooLargeException....

        Show
        Jason Rosenberg added a comment - Joe Stein , I'm not sure I understand your reasoning......and please understand that my primary issue in the ticket, is to provide visibility when we see a MessageSizeTooLargeException. Adding better documentation is nice (and great actually!), but does not address my issue raised in this ticket. Regarding KAFKA-998 , I think all that does is not repeatedly retry (e.g. 3 times) if the broker responds with a MessageSizeTooLargeException. So, I don't think that's an api breaking change to merge the patch there (not sure what Jun Rao 's concerns are there?). But certainly, currently, you cannot have any logic in the api that looks for the MessageSizeTooLargeException, so adding that visibility would not break any existing client code. You'd still throw a FailedToSendException as before, I'm just asking for extra info in the exception. I literally have TODO's in my code to the likes of "Once KAFKA-1025 is solved, remove this bogus guess-work logic to handle the possibility of a MessageSizeTooLargeException....
        Hide
        Joe Stein added a comment -

        Jason Rosenberg So what you are saying/asking is in DefaultEventHandler instead of passing null for the cause when throwing FailedToSendMessageException you want to pass in the Exception that is actually making the FailedToSendMessageException happen? Then on the catch of the FailedToSendMessageException you could then do .getCause() to interrogate why?

        Show
        Joe Stein added a comment - Jason Rosenberg So what you are saying/asking is in DefaultEventHandler instead of passing null for the cause when throwing FailedToSendMessageException you want to pass in the Exception that is actually making the FailedToSendMessageException happen? Then on the catch of the FailedToSendMessageException you could then do .getCause() to interrogate why?
        Hide
        Jason Rosenberg added a comment -

        Yes! Or some sort of error code.

        Show
        Jason Rosenberg added a comment - Yes! Or some sort of error code.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jason Rosenberg
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development