[GEARPUMP-55] Add kmeans example - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: 0.8.0
Fix Version/s: 0.8.1
Component/s: examples
Labels:
None

Description

There is a document about streaming kmeans in Spark (https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html), I think we can try to implement it on Gearpump. Here is my processor topology on Gearpump:

The `Source Processor` will produce points by time, then broadcast the point to the `Distribution Processor`. The number of tasks of the `Distribution Processor` is k, where each task save one center and the corresponding points. When `Distribution Processor` receives a point from `Source Processor`, it will calculate the distance of this point to its center, and then send the distance along with the point and its `taskId` to the `Collection Processor`. When `Collection Processor` receives the distance from `Distribution Processor`, it will accumulate the number of current points, determine if it's time to update center, choose the smallest distance and then send the point along with its corresponding `Distribution Processor` taskId by broadcast partitioner. When `Distribution Processor` receives the result message, task with the corresponding `taskId` will accumulate the point. If `Distribution Processor` receives that it's time to update center, then all the tasks will update its corresponding center.

This procedure is streaming and the center of cluster will change by time.

Attachments

Issue Links

duplicates

GEARPUMP-110 Try streaming kmeans on Gearpump

Open

Activity

People

Assignee:: Unassigned

Reporter:: Kam Kasravi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Apr/16 12:23

Updated:: 27/Apr/16 06:52

Resolved:: 27/Apr/16 06:52

Agile

View on Board

Add kmeans example