Description
TL/DR improve errors tolerance from [none, all] to, [none, deserialization, transformation, put, all]
Hi all, It's my first time requesting an improvement here so, sorry if my request it not clear, if it's already been disregarded or if it's incomplete.
I am currently experiencing some issues with the Kafka Connect error handling and the DLQ setup that maybe is just my setup that is wrong or my understanding of it is wrong, that makes me assume that the current options provided by Kafka are insufficient.
I start by the current assumptions:
https://kafka.apache.org/documentation/#sourceconnectorconfigs_errors.tolerance
errors.tolerance
Behavior for tolerating errors during connector operation. 'none' is the default value and signals that any error will result in an immediate connector task failure; 'all' changes the behavior to skip over problematic records.
Type: | string |
---|---|
Default: | none |
Valid Values: | [none, all] |
Importance: | medium |
My understanding is that currently Kafka Connect framework allows you to either handle all errors as something ok or not, and leaves any further handling to the different plugin implementations of the Kafka Connectors themselves.
My experience is mainly in the Kafka Sink connectors.
What I've experience recently is something that is also reported here as a possible improvement on the individual connectors themselves.
https://github.com/confluentinc/kafka-connect-jdbc/issues/721
What I think is that Kafka Connect framework could provide an option to allow to better set the scenarios when we want to have records in the DLQ or when we want to have the connectors fail.
In my opinion failures in deserialization (Key, Header, Value Converters) or in the Transformation chain, are good errors to be candidates to go to the DLQ.
Errors when on Sink/Put are errors that should never be in the DLQ and instead should make the connectors fail, because this errors are not (or may not be) transient.
Trying to better explain, if I have a connectivity issue, or a table space issue, it makes no sense to try to move to next records and send all the records to the DLQ because until the target is up and running smoothly there would be no way to continue processing data.
I can imagine in a JDBC scenario and for example constraint violations that this would only happen to some records that we still would like them in the DLQ instead of failing the full pipeline, that why I think a configuration for "put" stage should also exist.
Let me know if this is clear, and if any of my understanding is completely wrong.
Best regards,
Miguel