Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-1282

Spinning up more containers than the number of tasks kills leader

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.13.1
    • Component/s: container
    • Labels:
      None

      Description

      When a user tries to spin up more containers than the max partitions or tasks, the leader process gets killed.

      We throw an exception in the TaskNameGrouper for the above scenario and that needs to be handled gracefully by the leader and kill the newly spun containers as opposed bailing out.

      Here is the stack trace

       2017-05-10 15:13:24.526 [debounce-thread-0] ScheduleAfterDebounceTime [ERROR] OnProcessorChange threw an exception.
      java.lang.IllegalArgumentException: number of containers 2 is bigger than number of tasks 1
      	at org.apache.samza.container.grouper.task.GroupByContainerIds.group(GroupByContainerIds.java:68)
      	at org.apache.samza.coordinator.JobModelManager$.readJobModel(JobModelManager.scala:258)
      	at org.apache.samza.coordinator.JobModelManager.readJobModel(JobModelManager.scala)
      	at org.apache.samza.zk.ZkJobCoordinator.generateNewJobModel(ZkJobCoordinator.java:212)
      	at org.apache.samza.zk.ZkJobCoordinator.doOnProcessorChange(ZkJobCoordinator.java:125)
      	at org.apache.samza.zk.ZkJobCoordinator.lambda$onProcessorChange$1(ZkJobCoordinator.java:120)
      	at org.apache.samza.zk.ScheduleAfterDebounceTime.lambda$scheduleAfterDebounceTime$0(ScheduleAfterDebounceTime.java:89)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:748)
      
      

        Issue Links

          Activity

          Hide
          spvenkat Shanthoosh Venkataraman added a comment -

          Scenario : When there’re more stream processors(P) than tasks(T) [ X is number of stream processors, Y is number of tasks. X > Y].

          Current behavior : Fail with RuntimeException.

          Possible solutions:
          Solution A:

          Sort the stream processors using unique zookeeper sequential id associated with each processor. Generate job model using ‘Y’ lexicographically least stream processors and kill the rest of stream processors.

          Pros:

          • Straight forward and doesn’t require much change.

          Cons:

          • Additional stream processors are killed instead of using them when there're death to existing members of processors group.

          Solution B:

          Sort the stream processors using unique zookeeper sequential id associated with each processor. 
Generate job model using ‘Y’ lexicographically least stream processors and allow additional stream processors to live (could join group when any chosen stream processor dies). Will require each stream processor to hold local state (if it’s part of a group or not) and ignore zookeeper events if not part of the group.

          Pros:

          • Improved fault tolerance to stream processor deaths in a group.

          Cons:

          • Expected obvious performance drop since standby processors consume system resources and receive zookeeper events.
          Show
          spvenkat Shanthoosh Venkataraman added a comment - Scenario : When there’re more stream processors(P) than tasks(T) [ X is number of stream processors, Y is number of tasks. X > Y]. Current behavior : Fail with RuntimeException. Possible solutions: Solution A: Sort the stream processors using unique zookeeper sequential id associated with each processor. Generate job model using ‘Y’ lexicographically least stream processors and kill the rest of stream processors. Pros: Straight forward and doesn’t require much change. Cons: Additional stream processors are killed instead of using them when there're death to existing members of processors group. Solution B: Sort the stream processors using unique zookeeper sequential id associated with each processor. 
Generate job model using ‘Y’ lexicographically least stream processors and allow additional stream processors to live (could join group when any chosen stream processor dies). Will require each stream processor to hold local state (if it’s part of a group or not) and ignore zookeeper events if not part of the group. Pros: Improved fault tolerance to stream processor deaths in a group. Cons: Expected obvious performance drop since standby processors consume system resources and receive zookeeper events.
          Hide
          spvenkat Shanthoosh Venkataraman added a comment -

          Navina Ramesh Boris Shkolnik

          Above two approaches are possible to this problem.

          I prefer B.(I think it's long term, but requires big change compared to A).

          Please share your thoughts.

          Show
          spvenkat Shanthoosh Venkataraman added a comment - Navina Ramesh Boris Shkolnik Above two approaches are possible to this problem. I prefer B.(I think it's long term, but requires big change compared to A). Please share your thoughts.
          Hide
          boryas Boris Shkolnik added a comment -

          As long as the main thing (dying of the Leader) is fixed, the other two solutions are kind of equal, and depend on user preferences. I am ok with solution B (the extra task just dies), but we need to make the change for the Leader to avoid the regeneration of the JobModel in case list of participants don't change.

          Show
          boryas Boris Shkolnik added a comment - As long as the main thing (dying of the Leader) is fixed, the other two solutions are kind of equal, and depend on user preferences. I am ok with solution B (the extra task just dies), but we need to make the change for the Leader to avoid the regeneration of the JobModel in case list of participants don't change.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user shanthoosh opened a pull request:

          https://github.com/apache/samza/pull/244

          SAMZA-1282: Spinning up more containers than number of tasks.

          Changes

          • Stop streamProcessor in onNewJobModelAvailable eventHandler(instead of onNewJobModelConfirmed
            eventHandler) when it's not part of the group and prevent it from joining the barrier.
          • When numContainerIds > numTaskModels, generate JobModel by choosing lexicographically
            least `x` containerIds(where x = numTaskModels).
          • Added unit and integration tests in appropriate classes to verify the expected behavior.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/shanthoosh/samza more_processor_than_tasks

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/samza/pull/244.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #244


          commit 2e1d372e18fbb8077f5d02907ef4e68bbbdfc6a8
          Author: Shanthoosh Venkataraman <svenkataraman@linkedin.com>
          Date: 2017-07-12T23:19:28Z

          SAMZA-1282: Spinning up more containers than number of tasks.

          Changes

          • Stop streamProcessor in onNewJobModelAvailable eventHandler(instead of onNewJobModelConfirmed
            eventHandler) when it's not part of the group and prevent it from joining the barrier.
          • When numContainerIds > numTaskModels, generate JobModel by choosing lexicographically
            least `x` containerIds(where x = numTaskModels).
          • Added unit and integration tests in appropriate classes to verify the expected behavior.

          Show
          githubbot ASF GitHub Bot added a comment - GitHub user shanthoosh opened a pull request: https://github.com/apache/samza/pull/244 SAMZA-1282 : Spinning up more containers than number of tasks. Changes Stop streamProcessor in onNewJobModelAvailable eventHandler(instead of onNewJobModelConfirmed eventHandler) when it's not part of the group and prevent it from joining the barrier. When numContainerIds > numTaskModels, generate JobModel by choosing lexicographically least `x` containerIds(where x = numTaskModels). Added unit and integration tests in appropriate classes to verify the expected behavior. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shanthoosh/samza more_processor_than_tasks Alternatively you can review and apply these changes as the patch at: https://github.com/apache/samza/pull/244.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #244 commit 2e1d372e18fbb8077f5d02907ef4e68bbbdfc6a8 Author: Shanthoosh Venkataraman <svenkataraman@linkedin.com> Date: 2017-07-12T23:19:28Z SAMZA-1282 : Spinning up more containers than number of tasks. Changes Stop streamProcessor in onNewJobModelAvailable eventHandler(instead of onNewJobModelConfirmed eventHandler) when it's not part of the group and prevent it from joining the barrier. When numContainerIds > numTaskModels, generate JobModel by choosing lexicographically least `x` containerIds(where x = numTaskModels). Added unit and integration tests in appropriate classes to verify the expected behavior.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/samza/pull/244

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/samza/pull/244
          Hide
          navina Navina Ramesh added a comment -

          Issue resolved by pull request 244
          https://github.com/apache/samza/pull/244

          Show
          navina Navina Ramesh added a comment - Issue resolved by pull request 244 https://github.com/apache/samza/pull/244

            People

            • Assignee:
              spvenkat Shanthoosh Venkataraman
              Reporter:
              bharathkk Bharath Kumarasubramanian
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development