Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 0.8.0
    • 0.8.1
    • examples
    • None

    Description

      From pangolulu

      There is a document about streaming kmeans in Spark (https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html), I think we can try to implement it on Gearpump. Here is my processor topology on Gearpump:

      The `Source Processor` will produce points by time, then broadcast the point to the `Distribution Processor`. The number of tasks of the `Distribution Processor` is k, where each task save one center and the corresponding points. When `Distribution Processor` receives a point from `Source Processor`, it will calculate the distance of this point to its center, and then send the distance along with the point and its `taskId` to the `Collection Processor`. When `Collection Processor` receives the distance from `Distribution Processor`, it will accumulate the number of current points, determine if it's time to update center, choose the smallest distance and then send the point along with its corresponding `Distribution Processor` taskId by broadcast partitioner. When `Distribution Processor` receives the result message, task with the corresponding `taskId` will accumulate the point. If `Distribution Processor` receives that it's time to update center, then all the tasks will update its corresponding center.

      This procedure is streaming and the center of cluster will change by time.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Kam Kasravi Kam Kasravi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Slack

                  Issue deployment