Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-871

Implement heart-beat mechanism between JobCoordinator and all running containers

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      Right now, Samza relies on YARN to detect whether a container is alive or not. This has a few problems:
      1) with the effort to make standalone Samza (SAMZA-516) and make Samza more pluggable w/ other distributed cluster management system (like Mesos, Kubernetes), we need to make the container liveness detection independent.
      2) YARN based liveness detection has also created problems w/ leaking containers when NM crashed. It creates a dilemma:

        1. In the case that NM can be restarted quickly, we would like to keep the container alive w/o being affected by NM goes down since that saves ongoing work. yarn.nodemanager.recovery.enabled=true
        2. However, when RM loses the heart beat from NM and determines that the container is "dead", we truly need to make sure to kill the container to avoid duplicate containers being launched, since AM has no other way to know whether the container is actually alive or not.

      If we implement a direct heart beat mechanism between Samza JobCoordinator and SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync status is.

      Possible approaches could be:
      1) Use JobCoordinator HTTP port for heart beat. Pros: simple, synchronous communication. Cons: would potentially be a bottleneck in a job w/ a lot of containers, hard to tune the timeout value
      2) Use CoordinatorStream as the heart beat channel. Pros: use async pub-sub model to avoid timeouts in sync methods, easy to scale to a large number of containers; Cons: protocol is more complex to implement, message/token delivery latency maybe uncertain and make the heart beat process much longer.

        Issue Links

          Activity

          Hide
          capricornius Chen Song added a comment -

          We have a similar need in our company. Let me give a try on this.

          Show
          capricornius Chen Song added a comment - We have a similar need in our company. Let me give a try on this.
          Hide
          abkshvn Abhishek Shivanna added a comment -

          Chen Song I have a prototype already working for this issue. I will submit the SEP and the patch in the next couple of days. I would love to get inputs from you if you are also interested in working on this.

          Show
          abkshvn Abhishek Shivanna added a comment - Chen Song I have a prototype already working for this issue. I will submit the SEP and the patch in the next couple of days. I would love to get inputs from you if you are also interested in working on this.
          Hide
          jmakes Jake Maes added a comment -

          Hey Chen Song,

          Abhishek at LinkedIn just started looking into this last week. If you haven't made it very far with this, you might want to share notes and Abhishek can do it. CC. Abhishek Shivanna

          Show
          jmakes Jake Maes added a comment - Hey Chen Song , Abhishek at LinkedIn just started looking into this last week. If you haven't made it very far with this, you might want to share notes and Abhishek can do it. CC. Abhishek Shivanna
          Hide
          capricornius Chen Song added a comment -

          I have not started working on that yet. I will un-assign it.

          Show
          capricornius Chen Song added a comment - I have not started working on that yet. I will un-assign it.
          Show
          abkshvn Abhishek Shivanna added a comment - The SEP discussing the changes is here https://cwiki.apache.org/confluence/display/SAMZA/SEP-3%3A+Heart-beat+mechanism+between+JobCoordinator+and+all+running+containers
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user abhishekshivanna opened a pull request:

          https://github.com/apache/samza/pull/163

          SAMZA-871: Heart-beat mechanism between JobCoordinator and all running containers

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/abhishekshivanna/samza master

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/samza/pull/163.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #163


          commit ad6f5df48c96a30fd8823dba964f1d30ee5f3eb1
          Author: Abhishek Shivanna <ashivanna@linkedin.com>
          Date: 2017-04-26T00:59:43Z

          SAMZA-871: Implement heart-beat mechanism between JobCoordinator and all running containers


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user abhishekshivanna opened a pull request: https://github.com/apache/samza/pull/163 SAMZA-871 : Heart-beat mechanism between JobCoordinator and all running containers You can merge this pull request into a Git repository by running: $ git pull https://github.com/abhishekshivanna/samza master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/samza/pull/163.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #163 commit ad6f5df48c96a30fd8823dba964f1d30ee5f3eb1 Author: Abhishek Shivanna <ashivanna@linkedin.com> Date: 2017-04-26T00:59:43Z SAMZA-871 : Implement heart-beat mechanism between JobCoordinator and all running containers
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/samza/pull/163

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/samza/pull/163

            People

            • Assignee:
              abkshvn Abhishek Shivanna
              Reporter:
              nickpan47 Yi Pan (Data Infrastructure)
            • Votes:
              4 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development