Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1154

Implementing Streaming KMeans



    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.8
    • Component/s: Clustering
    • Labels:


      An implementation of Streaming KMeans as mentioned in [1] is available here [2].

      [2] https://github.com/dfilimon/mahout

      Since there will be more than one patches, there will be specific JIRA issues that address each one.

      The description of the code being added is:

      The main classes are in o.a.m.clustering.streaming [1], under the
      core/ project. These are subdivided into 2 packages:

      • cluster: contains the BallKMeans and StreamingKMeans classes that
        can be used standalone.
        BallKMeans is exactly what it sounds like (uses k-means++ for the
        initialization, then does a normal k-means pass and ignoring
        StreamingKMeans implements the online clustering that doesn't return
        exactly k clusters, (it returns an estimate). This is used to
        approximate the data.
      • mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
        StreamingKMeansMapper and StreamingKMeansReducer classes.
        CentroidWritable serializes Centroids (sort of like AbstractCluster).
        StreamingKMeansDriver provides the driver for the job.
        StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
        sketches of the data for the reducer.
        StreamingKMeansReducer collects the centroids produced by the
        mappers into one set of weighted points and runs BallKMeans on them
        producing the final results.

      Additionally the searchers are in o.a.m.math.neighborhood

      • neighborhood: various searcher classes that implement nearest-neighbor
        search using different strategies.
        Searcher, UpdatableSearcher: abstract classes that define how to
        search through collections of vectors.
        BruteSearch: does a brute search (looks at every point...)
        ProjectionSearch: uses random projections for searching.
        FastProjectionSearch: also uses random projections (but not binary
        search trees as in ProjectionSearch).
        HashedVector, LocalitySensitiveHashSearch: implement locality
        sensitive hash search.

      All the tools that I used are in o.a.m.clustering.streaming [2], under
      the examples/ project.
      There are a bunch of classes here, covering everything from
      vectorizing 20 newsgroups data to various IO utils. The more important
      ones are:
      utils.ExperimentUtils: convenience methods.
      tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.

      [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
      [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming

      The relevant issues are:




            • Assignee:
              dfilimon Dan Filimon
              dfilimon Dan Filimon
            • Votes:
              1 Vote for this issue
              5 Start watching this issue


              • Created: