Kafka
  1. Kafka
  2. KAFKA-3054

Connect Herder fail forever if sent a wrong connector config or task config

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Done
    • Affects Version/s: 0.9.0.0
    • Fix Version/s: 0.10.1.0
    • Component/s: KafkaConnect
    • Labels:
      None

      Description

      Connector Herder throws ConnectException and shutdown if sent a wrong config, restarting herder will keep failing with the wrong config; It make sense that herder should stay available when start connector or task failed; After receiving a delete connector request, the herder can delete the wrong config from "config storage"

        Issue Links

          Activity

          Hide
          jin xing added a comment -

          Ewen Cheslack-Postava
          Is it a bug?

          Show
          jin xing added a comment - Ewen Cheslack-Postava Is it a bug?
          Hide
          Ewen Cheslack-Postava added a comment -

          jin xing Yes, this sounds like a bug. In general I think there are a number of places where we need to do a better job handling errors – e.g. this is during startup, but also during execution of tasks when an error can keep happening repeatedly such that a task can't even make any progress (whether the issue is in Connect or the other system). In order to better handle this generally we're going to need to keep track of status info, expose it via the REST API, and allow users to take corrective action (e.g. reconfiguring, restarting tasks, etc).

          However, that's a pretty big project. For this bug, it sounds like we're just missing a catch block during connector/task startup which we should instead be catching and then handling by, at a minimum, logging some info at ERROR level.

          Show
          Ewen Cheslack-Postava added a comment - jin xing Yes, this sounds like a bug. In general I think there are a number of places where we need to do a better job handling errors – e.g. this is during startup, but also during execution of tasks when an error can keep happening repeatedly such that a task can't even make any progress (whether the issue is in Connect or the other system). In order to better handle this generally we're going to need to keep track of status info, expose it via the REST API, and allow users to take corrective action (e.g. reconfiguring, restarting tasks, etc). However, that's a pretty big project. For this bug, it sounds like we're just missing a catch block during connector/task startup which we should instead be catching and then handling by, at a minimum, logging some info at ERROR level.
          Hide
          jin xing added a comment -

          Ewen Cheslack-Postava
          thanks for comment : )
          currently the DistributedHerder only catch the ConfigException, thus any other exceptions thrown during connector startup or task startup will kill the DistributedHerder and Worker;
          If the cluster has only one DistributedHerder, restart will fail forever;
          It make sense to let the herder swallow all exceptions thrown by connector or task during handling the life cycle of connector and task, thus Herder and Worker can keep running;
          How do you think?

          Show
          jin xing added a comment - Ewen Cheslack-Postava thanks for comment : ) currently the DistributedHerder only catch the ConfigException, thus any other exceptions thrown during connector startup or task startup will kill the DistributedHerder and Worker; If the cluster has only one DistributedHerder, restart will fail forever; It make sense to let the herder swallow all exceptions thrown by connector or task during handling the life cycle of connector and task, thus Herder and Worker can keep running; How do you think?
          Hide
          Ewen Cheslack-Postava added a comment -

          jin xing We do want to catch them, but probably don't want to just swallow them. Although that might be a short-term solution for this specific problem. We don't do a good job of tracking connector/task status in Connect right now. We'll need to track this information (and also expose it via the REST API, and allow control via APIs like suggested in KAFKA-2370). I know Jason Gustafson is also working on KAFKA-2886 now, which also faces the same problem – we can sort of half fix the issue before we have support for tracking status info.

          I'd say a good short term solution would be to catch other exceptions and at a minimum log it at ERROR level. I haven't thought through the types of exceptions that might be generated, but it's possible we'll want to treat different exceptions somewhat differently (e.g. if they throw a ConnectException, the connector may have hit an issue, but is behaving well; if they throw anything that we can only classify as Throwable, we probably want to treat that as a bug in the connector itself and complain more loudly about it in the log). Then you might want to file a follow-up JIRA to make sure we don't lose track of that status change when we have support for tracking it.

          Show
          Ewen Cheslack-Postava added a comment - jin xing We do want to catch them, but probably don't want to just swallow them. Although that might be a short-term solution for this specific problem. We don't do a good job of tracking connector/task status in Connect right now. We'll need to track this information (and also expose it via the REST API, and allow control via APIs like suggested in KAFKA-2370 ). I know Jason Gustafson is also working on KAFKA-2886 now, which also faces the same problem – we can sort of half fix the issue before we have support for tracking status info. I'd say a good short term solution would be to catch other exceptions and at a minimum log it at ERROR level. I haven't thought through the types of exceptions that might be generated, but it's possible we'll want to treat different exceptions somewhat differently (e.g. if they throw a ConnectException, the connector may have hit an issue, but is behaving well; if they throw anything that we can only classify as Throwable, we probably want to treat that as a bug in the connector itself and complain more loudly about it in the log). Then you might want to file a follow-up JIRA to make sure we don't lose track of that status change when we have support for tracking it.
          Hide
          ASF GitHub Bot added a comment -

          Github user ZoneMayor closed the pull request at:

          https://github.com/apache/kafka/pull/782

          Show
          ASF GitHub Bot added a comment - Github user ZoneMayor closed the pull request at: https://github.com/apache/kafka/pull/782
          Hide
          ASF GitHub Bot added a comment -

          GitHub user ZoneMayor reopened a pull request:

          https://github.com/apache/kafka/pull/782

          KAFKA-3054: Connect Herder fail if sent a wrong connector config or task config, catch exceptions to guard

          The exceptions may propagate to DistributedHerder when start or stop connectors or tasks;
          The best solution is to track the track the status of connectors and tasks;
          This patch catch exceptions to guard the DistributedHerder; this a short-term solution;

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/ZoneMayor/kafka trunk-KAFKA-3054

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/kafka/pull/782.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #782


          commit 34240b52e1b70aa172b65155f6042243d838b420
          Author: ZoneMayor <jinxing6042@126.com>
          Date: 2015-12-18T07:22:20Z

          Merge pull request #12 from apache/trunk

          2015-12-18

          commit 52d02f333e86d06cfa8fff5facd18999b3db6d83
          Author: ZoneMayor <jinxing6042@126.com>
          Date: 2015-12-30T03:08:08Z

          Merge pull request #13 from apache/trunk

          2015-12-30

          commit d56be0b9e0849660c07d656c6019f9cc2f17ae55
          Author: ZoneMayor <jinxing6042@126.com>
          Date: 2016-01-10T09:24:06Z

          Merge pull request #14 from apache/trunk

          2016-1-10

          commit e75c8bbb3329868f92becc86645d31fa21c6c7f4
          Author: jinxing <jinxing@fenbi.com>
          Date: 2016-01-17T03:11:45Z

          KAFKA-3054: Connect Herder fail if sent a wrong connector config or task config, catch exceptions to guard


          Show
          ASF GitHub Bot added a comment - GitHub user ZoneMayor reopened a pull request: https://github.com/apache/kafka/pull/782 KAFKA-3054 : Connect Herder fail if sent a wrong connector config or task config, catch exceptions to guard The exceptions may propagate to DistributedHerder when start or stop connectors or tasks; The best solution is to track the track the status of connectors and tasks; This patch catch exceptions to guard the DistributedHerder; this a short-term solution; You can merge this pull request into a Git repository by running: $ git pull https://github.com/ZoneMayor/kafka trunk- KAFKA-3054 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/782.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #782 commit 34240b52e1b70aa172b65155f6042243d838b420 Author: ZoneMayor <jinxing6042@126.com> Date: 2015-12-18T07:22:20Z Merge pull request #12 from apache/trunk 2015-12-18 commit 52d02f333e86d06cfa8fff5facd18999b3db6d83 Author: ZoneMayor <jinxing6042@126.com> Date: 2015-12-30T03:08:08Z Merge pull request #13 from apache/trunk 2015-12-30 commit d56be0b9e0849660c07d656c6019f9cc2f17ae55 Author: ZoneMayor <jinxing6042@126.com> Date: 2016-01-10T09:24:06Z Merge pull request #14 from apache/trunk 2016-1-10 commit e75c8bbb3329868f92becc86645d31fa21c6c7f4 Author: jinxing <jinxing@fenbi.com> Date: 2016-01-17T03:11:45Z KAFKA-3054 : Connect Herder fail if sent a wrong connector config or task config, catch exceptions to guard
          Hide
          ASF GitHub Bot added a comment -

          Github user ZoneMayor closed the pull request at:

          https://github.com/apache/kafka/pull/782

          Show
          ASF GitHub Bot added a comment - Github user ZoneMayor closed the pull request at: https://github.com/apache/kafka/pull/782
          Hide
          ASF GitHub Bot added a comment -

          GitHub user ZoneMayor reopened a pull request:

          https://github.com/apache/kafka/pull/782

          KAFKA-3054: Connect Herder fail if sent a wrong connector config or task config, catch exceptions to guard

          The exceptions may propagate to DistributedHerder when start or stop connectors or tasks;
          The best solution is to track the track the status of connectors and tasks;
          This patch catch exceptions to guard the DistributedHerder; this a short-term solution;

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/ZoneMayor/kafka trunk-KAFKA-3054

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/kafka/pull/782.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #782


          commit 34240b52e1b70aa172b65155f6042243d838b420
          Author: ZoneMayor <jinxing6042@126.com>
          Date: 2015-12-18T07:22:20Z

          Merge pull request #12 from apache/trunk

          2015-12-18

          commit 52d02f333e86d06cfa8fff5facd18999b3db6d83
          Author: ZoneMayor <jinxing6042@126.com>
          Date: 2015-12-30T03:08:08Z

          Merge pull request #13 from apache/trunk

          2015-12-30

          commit d56be0b9e0849660c07d656c6019f9cc2f17ae55
          Author: ZoneMayor <jinxing6042@126.com>
          Date: 2016-01-10T09:24:06Z

          Merge pull request #14 from apache/trunk

          2016-1-10

          commit e75c8bbb3329868f92becc86645d31fa21c6c7f4
          Author: jinxing <jinxing@fenbi.com>
          Date: 2016-01-17T03:11:45Z

          KAFKA-3054: Connect Herder fail if sent a wrong connector config or task config, catch exceptions to guard

          commit efd6023821df014b06e509c1a13fd4e9d08113cb
          Author: jinxing <jinxing@fenbi.com>
          Date: 2016-01-25T13:20:36Z

          catch the exception thrown during stopping connector or task and try to restart

          commit d05af92d44ed26001f0ec96f5376790d2009249e
          Author: jinxing <jinxing@fenbi.com>
          Date: 2016-01-25T13:22:45Z

          small fix

          commit b6521f7d90b8fabfe3bfd086c188977d444660b5
          Author: jinxing <jinxing@fenbi.com>
          Date: 2016-01-25T13:23:12Z

          small fix


          Show
          ASF GitHub Bot added a comment - GitHub user ZoneMayor reopened a pull request: https://github.com/apache/kafka/pull/782 KAFKA-3054 : Connect Herder fail if sent a wrong connector config or task config, catch exceptions to guard The exceptions may propagate to DistributedHerder when start or stop connectors or tasks; The best solution is to track the track the status of connectors and tasks; This patch catch exceptions to guard the DistributedHerder; this a short-term solution; You can merge this pull request into a Git repository by running: $ git pull https://github.com/ZoneMayor/kafka trunk- KAFKA-3054 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/782.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #782 commit 34240b52e1b70aa172b65155f6042243d838b420 Author: ZoneMayor <jinxing6042@126.com> Date: 2015-12-18T07:22:20Z Merge pull request #12 from apache/trunk 2015-12-18 commit 52d02f333e86d06cfa8fff5facd18999b3db6d83 Author: ZoneMayor <jinxing6042@126.com> Date: 2015-12-30T03:08:08Z Merge pull request #13 from apache/trunk 2015-12-30 commit d56be0b9e0849660c07d656c6019f9cc2f17ae55 Author: ZoneMayor <jinxing6042@126.com> Date: 2016-01-10T09:24:06Z Merge pull request #14 from apache/trunk 2016-1-10 commit e75c8bbb3329868f92becc86645d31fa21c6c7f4 Author: jinxing <jinxing@fenbi.com> Date: 2016-01-17T03:11:45Z KAFKA-3054 : Connect Herder fail if sent a wrong connector config or task config, catch exceptions to guard commit efd6023821df014b06e509c1a13fd4e9d08113cb Author: jinxing <jinxing@fenbi.com> Date: 2016-01-25T13:20:36Z catch the exception thrown during stopping connector or task and try to restart commit d05af92d44ed26001f0ec96f5376790d2009249e Author: jinxing <jinxing@fenbi.com> Date: 2016-01-25T13:22:45Z small fix commit b6521f7d90b8fabfe3bfd086c188977d444660b5 Author: jinxing <jinxing@fenbi.com> Date: 2016-01-25T13:23:12Z small fix
          Hide
          Ewen Cheslack-Postava added a comment -

          We improved error handling in 0.10.0.0. In theory we should be catching and handling these errors, marking the connector/task as dead. However, we should make sure we have a test covering this specific case to validate the handling before marking this resolved. And we should make sure we cover both types of invalid configs for both types of connectors – since we do some initial parsing of configs to setup the connector within the framework and then the connector does its own parsing, we should make sure all failure scenarios are handled.

          Show
          Ewen Cheslack-Postava added a comment - We improved error handling in 0.10.0.0. In theory we should be catching and handling these errors, marking the connector/task as dead. However, we should make sure we have a test covering this specific case to validate the handling before marking this resolved. And we should make sure we cover both types of invalid configs for both types of connectors – since we do some initial parsing of configs to setup the connector within the framework and then the connector does its own parsing, we should make sure all failure scenarios are handled.
          Hide
          Shikhar Bhushan added a comment -

          Addressing this in KAFKA-4042, which should take care of remaining robustness issues in the DistributedHerder from bad connector or task configs.

          Show
          Shikhar Bhushan added a comment - Addressing this in KAFKA-4042 , which should take care of remaining robustness issues in the DistributedHerder from bad connector or task configs.
          Hide
          ASF GitHub Bot added a comment -

          Github user pono closed the pull request at:

          https://github.com/apache/kafka/pull/782

          Show
          ASF GitHub Bot added a comment - Github user pono closed the pull request at: https://github.com/apache/kafka/pull/782

            People

            • Assignee:
              Shikhar Bhushan
              Reporter:
              jin xing
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development