[KYLIN-4167] Refactor streaming coordinator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: v3.0.0
Component/s: Real-time Streaming
Labels:
None

Description

Summary

Currently, coordinator has too many responsibility, which violate single responsibility principle, and it not easy for extension, a good separation of responsibilities is a recommended way.
Some cluster level operation has no atomicity guarantee, we should implement then in idempotent way to achieve final consistency
Resubmit when job was discarded
Clarify overall design for realtime OLAP

StreamingCoordinator

Facade of coordinator, will controll BuildJobSummitter/ReceiverClusterMangaer and delegate operation to them.

BuildJobSubmitter

The main responsibility of BuildJobSubmitter including:

1. Try to find candidate segment which ready to submit a build job

2. Trace the status of candidate segment's build job and promote segment if it is has met requirements

ReceiverClusterManager

This class manage operation related to multi streaming receivers. They are often not atomic and maybe idempotent.

ClusterStateChecker

Basic step of this class:

1. stop/pause coordinator to avoid underlying concurrency issue

2. check inconsistent state of all receiver cluster

3. send summary via mail to kylin admin

4. if need, call ClusterDoctor to repair inconsistent issue

ClusterDoctor

Repair inconsistent state according to result of ClusterStateChecker

Candidate Segment

The candidate segments are those segments what can be saw/perceived by streaming coordinator,

candidate segment could be divided into following state/queue:

1. segment which data are uploaded PARTLY

2. segment which data are uploaded completely and WAITING to build

3. segment which in BUILDING state, job's state should be one of (NEW/RUNNING/ERROR/DISCARD)

4. segment which built succeed and wait to be delivered to historical part (and to be deleted in realtime part)

5. segment which in historical part(HBase Ready Segment)

By design, segment should transfer to next queue in sequential way(shouldn't jump the queue), do not break this.

Atomicity

In a multi-step transcation, following acepts should be thought twice:

1. should fail fast or continue when exception thrown.

2. should API(remote call) be synchronous or asynchronous

3. when transcation failed, could roll back always succeed

4. transcation should be idempotent so when it failed, it could be fixed by retry

How to ensure whole cluster opreation smoothly without blocking problem. I divided all multi-step transcation into three kinds:

NotAtomicIdempotent

NotAtomicAndNotIdempotent

NonSideEffect

Attachments

Issue Links

links to

GitHub Pull Request #851

GitHub Pull Request #961

Activity

People

Assignee:: Xiaoxiang Yu

Reporter:: Xiaoxiang Yu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Sep/19 15:52

Updated:: 21/Jan/20 07:48

Resolved:: 26/Dec/19 06:25