[MAHOUT-1154] Implementing Streaming KMeans - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8
Fix Version/s: 0.8
Component/s: classic
Labels:
None

Description

An implementation of Streaming KMeans as mentioned in [1] is available here [2].

[1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E
[2] https://github.com/dfilimon/mahout

Since there will be more than one patches, there will be specific JIRA issues that address each one.

The description of the code being added is:

The main classes are in o.a.m.clustering.streaming [1], under the
core/ project. These are subdivided into 2 packages:

cluster: contains the BallKMeans and StreamingKMeans classes that
can be used standalone.
BallKMeans is exactly what it sounds like (uses k-means++ for the
initialization, then does a normal k-means pass and ignoring
outilers).
StreamingKMeans implements the online clustering that doesn't return
exactly k clusters, (it returns an estimate). This is used to
approximate the data.

mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
StreamingKMeansMapper and StreamingKMeansReducer classes.
CentroidWritable serializes Centroids (sort of like AbstractCluster).
StreamingKMeansDriver provides the driver for the job.
StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
sketches of the data for the reducer.
StreamingKMeansReducer collects the centroids produced by the
mappers into one set of weighted points and runs BallKMeans on them
producing the final results.

Additionally the searchers are in o.a.m.math.neighborhood

neighborhood: various searcher classes that implement nearest-neighbor
search using different strategies.
Searcher, UpdatableSearcher: abstract classes that define how to
search through collections of vectors.
BruteSearch: does a brute search (looks at every point...)
ProjectionSearch: uses random projections for searching.
FastProjectionSearch: also uses random projections (but not binary
search trees as in ProjectionSearch).
HashedVector, LocalitySensitiveHashSearch: implement locality
sensitive hash search.

All the tools that I used are in o.a.m.clustering.streaming [2], under
the examples/ project.
There are a bunch of classes here, covering everything from
vectorizing 20 newsgroups data to various IO utils. The more important
ones are:
utils.ExperimentUtils: convenience methods.
tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.

[3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
[4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming

The relevant issues are:

~~MAHOUT-1155~~ (Centroid, WeightedVector)
~~MAHOUT-1156~~ (searchers)
~~MAHOUT-1162~~ (clustering, non map-reduce)
~~MAHOUT-1181~~ (map-reduce, command-line changes, pom.xml)

Attachments

Activity

People

Assignee:: Dan Filimon

Reporter:: Dan Filimon

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 06/Mar/13 15:19

Updated:: 31/Jan/24 22:14

Resolved:: 02/Jun/13 19:42