- Currently, coordinator has too many responsibility, which violate single responsibility principle, and it not easy for extension, a good separation of responsibilities is a recommended way.
- Some cluster level operation has no atomicity guarantee, we should implement then in idempotent way to achieve final consistency
- Resubmit when job was discarded
- Clarify overall design for realtime OLAP
Facade of coordinator, will controll BuildJobSummitter/ReceiverClusterMangaer and delegate operation to them.
The main responsibility of BuildJobSubmitter including:
1. Try to find candidate segment which ready to submit a build job
2. Trace the status of candidate segment's build job and promote segment if it is has met requirements
This class manage operation related to multi streaming receivers. They are often not atomic and maybe idempotent.
Basic step of this class:
1. stop/pause coordinator to avoid underlying concurrency issue
2. check inconsistent state of all receiver cluster
3. send summary via mail to kylin admin
4. if need, call ClusterDoctor to repair inconsistent issue
Repair inconsistent state according to result of ClusterStateChecker
The candidate segments are those segments what can be saw/perceived by streaming coordinator,
candidate segment could be divided into following state/queue:
1. segment which data are uploaded PARTLY
2. segment which data are uploaded completely and WAITING to build
3. segment which in BUILDING state, job's state should be one of (NEW/RUNNING/ERROR/DISCARD)
4. segment which built succeed and wait to be delivered to historical part (and to be deleted in realtime part)
5. segment which in historical part(HBase Ready Segment)
By design, segment should transfer to next queue in sequential way(shouldn't jump the queue), do not break this.
In a multi-step transcation, following acepts should be thought twice:
1. should fail fast or continue when exception thrown.
2. should API(remote call) be synchronous or asynchronous
3. when transcation failed, could roll back always succeed
4. transcation should be idempotent so when it failed, it could be fixed by retry
How to ensure whole cluster opreation smoothly without blocking problem. I divided all multi-step transcation into three kinds: