Description
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E
[2] https://github.com/dfilimon/mahout
Since there will be more than one patches, there will be specific JIRA issues that address each one.
The description of the code being added is:
The main classes are in o.a.m.clustering.streaming [1], under the
core/ project. These are subdivided into 2 packages:
- cluster: contains the BallKMeans and StreamingKMeans classes that
can be used standalone.
BallKMeans is exactly what it sounds like (uses k-means++ for the
initialization, then does a normal k-means pass and ignoring
outilers).
StreamingKMeans implements the online clustering that doesn't return
exactly k clusters, (it returns an estimate). This is used to
approximate the data.
- mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
StreamingKMeansMapper and StreamingKMeansReducer classes.
CentroidWritable serializes Centroids (sort of like AbstractCluster).
StreamingKMeansDriver provides the driver for the job.
StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
sketches of the data for the reducer.
StreamingKMeansReducer collects the centroids produced by the
mappers into one set of weighted points and runs BallKMeans on them
producing the final results.
Additionally the searchers are in o.a.m.math.neighborhood
- neighborhood: various searcher classes that implement nearest-neighbor
search using different strategies.
Searcher, UpdatableSearcher: abstract classes that define how to
search through collections of vectors.
BruteSearch: does a brute search (looks at every point...)
ProjectionSearch: uses random projections for searching.
FastProjectionSearch: also uses random projections (but not binary
search trees as in ProjectionSearch).
HashedVector, LocalitySensitiveHashSearch: implement locality
sensitive hash search.
All the tools that I used are in o.a.m.clustering.streaming [2], under
the examples/ project.
There are a bunch of classes here, covering everything from
vectorizing 20 newsgroups data to various IO utils. The more important
ones are:
utils.ExperimentUtils: convenience methods.
tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
[3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
[4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming
The relevant issues are:
MAHOUT-1155(Centroid, WeightedVector)MAHOUT-1156(searchers)MAHOUT-1162(clustering, non map-reduce)MAHOUT-1181(map-reduce, command-line changes, pom.xml)