[SAMZA-1992] Hot standby containers for improving recovery-time of Stateful jobs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Recovery time for jobs with large state increases significantly in host-failure scenarios. This is problematic because of two reasons:
a) causes unavailability ,

and
b) causes backlog buildup in case of jobs with high input QPS, requiring scaleup (then scale-down) or causes increased catch-up time.

The solution comprises of two parts

1. Increasing restore parallelism: Samza restores stores at SamzaContainer startup sequentially for each task (using TaskStorageManager and TaskSideInputStorage Manager). Parallelizing task restores. We can parallelize store restores either in SamzaContainer (using a bounded/configurable threadpool) or in the TaskInstance or in the TaskStorageManager.

2. Hot-Standby Containers:
A Samza container which consumes input, reads or updates state, or invokes external services or produces outputs is called an “active container.”
A hot-standby container is one which carries updated or hot KV state, and guarantees that, when it is used for a failover, its KV state corresponds to atleast the last checkpoint of the corresponding active container (for each task). There are multiple ways of building such hot-standby container implementations, see this section for pros and cons. Restore time With hot-standby container, restore times should be similar to host-affinity restore-times – 10s of seconds regardless of state size.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Rayman

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Nov/18 19:54

Updated:: 19/Mar/19 17:15