[SAMZA-871] Implement heart-beat mechanism between JobCoordinator and all running containers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.13.0
Component/s: None
Labels:
None

Description

Right now, Samza relies on YARN to detect whether a container is alive or not. This has a few problems:
1) with the effort to make standalone Samza (SAMZA-516) and make Samza more pluggable w/ other distributed cluster management system (like Mesos, Kubernetes), we need to make the container liveness detection independent.
2) YARN based liveness detection has also created problems w/ leaking containers when NM crashed. It creates a dilemma:

1. In the case that NM can be restarted quickly, we would like to keep the container alive w/o being affected by NM goes down since that saves ongoing work. yarn.nodemanager.recovery.enabled=true
2. However, when RM loses the heart beat from NM and determines that the container is "dead", we truly need to make sure to kill the container to avoid duplicate containers being launched, since AM has no other way to know whether the container is actually alive or not.

If we implement a direct heart beat mechanism between Samza JobCoordinator and SamzaContainer, we can be agnostic to whatever the YARN RM/NM/AM sync status is.

Possible approaches could be:
1) Use JobCoordinator HTTP port for heart beat. Pros: simple, synchronous communication. Cons: would potentially be a bottleneck in a job w/ a lot of containers, hard to tune the timeout value
2) Use CoordinatorStream as the heart beat channel. Pros: use async pub-sub model to avoid timeouts in sync methods, easy to scale to a large number of containers; Cons: protocol is more complex to implement, message/token delivery latency maybe uncertain and make the heart beat process much longer.

Attachments

Issue Links

is depended upon by

SAMZA-921 Consolidate LocalityManager and TaskAssignmentManager

Open

is related to

SAMZA-881 Re-think the Samza Job Coordinator

Open

relates to

SAMZA-1116 Yarn RM recovery causing duplicate containers

Open

links to

GitHub Pull Request #163

Activity

People

Assignee:: Abhishek Shivanna

Reporter:: Yi Pan

Votes:: 4 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 10/Feb/16 08:11

Updated:: 11/May/17 19:46

Resolved:: 11/May/17 19:46