[SAMZA-798] Performance and stability issue after combining checkpoint and coordinator stream - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.10.0
Fix Version/s: 0.10.0
Component/s: None
Labels:
None

Description

When testing 0.10 release candidate w/ large number of topic partitions and containers, we have observed that there is a serious stability issue when combining checkpoint and coordinator streams together.

The reasons being the following:
1) The current JobCoordinator's use case of coordinator includes the following:

job configuration
task changelog partition map
container locality info
misc (like migration marker, etc.)
checkpoint
Out of all the above, checkpoint creates the biggest problem:
1) It is much more dynamic than others. Trying to keep up-to-state w/ the coordinator stream's tail becomes impossible due to checkpointing from all containers. Mixing checkpoint w/ other message together in one stream makes it impossible to differentiate the cases whether there is more important configuration/status information that has to be read immediately versus there are just checkpoint updates in the coordinator stream.
2) In case of checkpoint, it is not necessary for JobCoordinator to be in the middle of the path. Our previous checkpoint model actually works properly: the checkpoint is written by the containers and read by the containers, and it is very clear that read only happens when recover/restart the container while write happens during the container runtime. Making all containers rely on the JobCoordinator to read the latest checkpoint actually makes JobCoordinator the single point bottleneck.
3) With the current change, removing CheckpointManagerFactory also disable the possibility for users to plugin their own checkpoint system.

Bottom line is: the write rate to coordinator stream should just be configuration and other low volume information. High-volume traffic should not be sent to coordinator stream, which is often used by JobCoordinator to build an up-to-date job status.

In addition, it would be preferred to move the checkpoint back before this release to avoid the unnecessary migration of checkpoint to coordinator stream.

Attachments

SAMZA-798-2.patch
02/Nov/15 21:56
195 kB
Navina Ramesh
SAMZA-798-1.patch
31/Oct/15 01:26
195 kB
Navina Ramesh
SAMZA-798-0.patch
30/Oct/15 21:01
184 kB
Navina Ramesh

Issue Links

Add Link

is part of

SAMZA-348 Configure Samza jobs through a stream

Open

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Navina Ramesh

Reporter:: Yi Pan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Oct/15 02:36

Updated:: 02/Nov/15 22:08

Resolved:: 02/Nov/15 22:08

Agile

View on Board

Performance and stability issue after combining checkpoint and coordinator stream

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment