Details

Type: New Feature

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: 0.8

Fix Version/s: 0.8

Component/s: Clustering

Labels:None
Description
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mailarchives.apache.org/mod_mbox/mahoutdev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E
[2] https://github.com/dfilimon/mahout
Since there will be more than one patches, there will be specific JIRA issues that address each one.
The description of the code being added is:
The main classes are in o.a.m.clustering.streaming [1], under the
core/ project. These are subdivided into 2 packages:
 cluster: contains the BallKMeans and StreamingKMeans classes that
can be used standalone.
BallKMeans is exactly what it sounds like (uses kmeans++ for the
initialization, then does a normal kmeans pass and ignoring
outilers).
StreamingKMeans implements the online clustering that doesn't return
exactly k clusters, (it returns an estimate). This is used to
approximate the data.
 mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
StreamingKMeansMapper and StreamingKMeansReducer classes.
CentroidWritable serializes Centroids (sort of like AbstractCluster).
StreamingKMeansDriver provides the driver for the job.
StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
sketches of the data for the reducer.
StreamingKMeansReducer collects the centroids produced by the
mappers into one set of weighted points and runs BallKMeans on them
producing the final results.
Additionally the searchers are in o.a.m.math.neighborhood
 neighborhood: various searcher classes that implement nearestneighbor
search using different strategies.
Searcher, UpdatableSearcher: abstract classes that define how to
search through collections of vectors.
BruteSearch: does a brute search (looks at every point...)
ProjectionSearch: uses random projections for searching.
FastProjectionSearch: also uses random projections (but not binary
search trees as in ProjectionSearch).
HashedVector, LocalitySensitiveHashSearch: implement locality
sensitive hash search.
All the tools that I used are in o.a.m.clustering.streaming [2], under
the examples/ project.
There are a bunch of classes here, covering everything from
vectorizing 20 newsgroups data to various IO utils. The more important
ones are:
utils.ExperimentUtils: convenience methods.
tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
[3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
[4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming
The relevant issues are:
MAHOUT1155(Centroid, WeightedVector)MAHOUT1156(searchers)MAHOUT1162(clustering, non mapreduce)MAHOUT1181(mapreduce, commandline changes, pom.xml)
Activity
Field  Original Value  New Value 

Description 
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mailarchives.apache.org/mod_mbox/mahoutdev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E [2] https://github.com/dfilimon/mahout Since there will be more than one patches, there will be specific JIRA issues that address each one. The description of the code being added is: The main classes are in o.a.m.clustering.streaming [1], under the core/ project. These are subdivided into 3 packages:  cluster: contains the BallKMeans and StreamingKMeans classes that can be used standalone. BallKMeans is exactly what it sounds like (uses kmeans++ for the initialization, then does a normal kmeans pass and ignoring outilers). StreamingKMeans implements the online clustering that doesn't return exactly k clusters, (it returns an estimate). This is used to approximate the data.  mapreduce: contains the CentroidWritable, StreamingKMeansDriver, StreamingKMeansMapper and StreamingKMeansReducer classes. CentroidWritable serializes Centroids (sort of like AbstractCluster). StreamingKMeansDriver provides the driver for the job. StreamingKMeansMapper runs StreamingKMeans in the mappers to produce sketches of the data for the reducer. StreamingKMeansReducer collects the centroids produced by the mappers into one set of weighted points and runs BallKMeans on them producing the final results.  search: various searcher classes that implement nearestneighbor search using different strategies. Searcher, UpdatableSearcher: abstract classes that define how to search through collections of vectors. BruteSearch: does a brute search (looks at every point...) ProjectionSearch: uses random projections for searching. FastProjectionSearch: also uses random projections (but not binary search trees as in ProjectionSearch). HashedVector, LocalitySensitiveHashSearch: implement locality sensitive hash search. All the tools that I used are in o.a.m.clustering.streaming [2], under the examples/ project. There are a bunch of classes here, covering everything from vectorizing 20 newsgroups data to various IO utils. The more important ones are: utils.ExperimentUtils: convenience methods. tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming 
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mailarchives.apache.org/mod_mbox/mahoutdev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E [2] https://github.com/dfilimon/mahout Since there will be more than one patches, there will be specific JIRA issues that address each one. The description of the code being added is: The main classes are in o.a.m.clustering.streaming [1], under the core/ project. These are subdivided into 2 packages:  cluster: contains the BallKMeans and StreamingKMeans classes that can be used standalone. BallKMeans is exactly what it sounds like (uses kmeans++ for the initialization, then does a normal kmeans pass and ignoring outilers). StreamingKMeans implements the online clustering that doesn't return exactly k clusters, (it returns an estimate). This is used to approximate the data.  mapreduce: contains the CentroidWritable, StreamingKMeansDriver, StreamingKMeansMapper and StreamingKMeansReducer classes. CentroidWritable serializes Centroids (sort of like AbstractCluster). StreamingKMeansDriver provides the driver for the job. StreamingKMeansMapper runs StreamingKMeans in the mappers to produce sketches of the data for the reducer. StreamingKMeansReducer collects the centroids produced by the mappers into one set of weighted points and runs BallKMeans on them producing the final results. Additionally the searchers are in o.a.m.math.neighborhood  neighborhood: various searcher classes that implement nearestneighbor search using different strategies. Searcher, UpdatableSearcher: abstract classes that define how to search through collections of vectors. BruteSearch: does a brute search (looks at every point...) ProjectionSearch: uses random projections for searching. FastProjectionSearch: also uses random projections (but not binary search trees as in ProjectionSearch). HashedVector, LocalitySensitiveHashSearch: implement locality sensitive hash search. All the tools that I used are in o.a.m.clustering.streaming [2], under the examples/ project. There are a bunch of classes here, covering everything from vectorizing 20 newsgroups data to various IO utils. The more important ones are: utils.ExperimentUtils: convenience methods. tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming 
Description 
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mailarchives.apache.org/mod_mbox/mahoutdev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E [2] https://github.com/dfilimon/mahout Since there will be more than one patches, there will be specific JIRA issues that address each one. The description of the code being added is: The main classes are in o.a.m.clustering.streaming [1], under the core/ project. These are subdivided into 2 packages:  cluster: contains the BallKMeans and StreamingKMeans classes that can be used standalone. BallKMeans is exactly what it sounds like (uses kmeans++ for the initialization, then does a normal kmeans pass and ignoring outilers). StreamingKMeans implements the online clustering that doesn't return exactly k clusters, (it returns an estimate). This is used to approximate the data.  mapreduce: contains the CentroidWritable, StreamingKMeansDriver, StreamingKMeansMapper and StreamingKMeansReducer classes. CentroidWritable serializes Centroids (sort of like AbstractCluster). StreamingKMeansDriver provides the driver for the job. StreamingKMeansMapper runs StreamingKMeans in the mappers to produce sketches of the data for the reducer. StreamingKMeansReducer collects the centroids produced by the mappers into one set of weighted points and runs BallKMeans on them producing the final results. Additionally the searchers are in o.a.m.math.neighborhood  neighborhood: various searcher classes that implement nearestneighbor search using different strategies. Searcher, UpdatableSearcher: abstract classes that define how to search through collections of vectors. BruteSearch: does a brute search (looks at every point...) ProjectionSearch: uses random projections for searching. FastProjectionSearch: also uses random projections (but not binary search trees as in ProjectionSearch). HashedVector, LocalitySensitiveHashSearch: implement locality sensitive hash search. All the tools that I used are in o.a.m.clustering.streaming [2], under the examples/ project. There are a bunch of classes here, covering everything from vectorizing 20 newsgroups data to various IO utils. The more important ones are: utils.ExperimentUtils: convenience methods. tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming 
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mailarchives.apache.org/mod_mbox/mahoutdev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E [2] https://github.com/dfilimon/mahout Since there will be more than one patches, there will be specific JIRA issues that address each one. The description of the code being added is: The main classes are in o.a.m.clustering.streaming [1], under the core/ project. These are subdivided into 2 packages:  cluster: contains the BallKMeans and StreamingKMeans classes that can be used standalone. BallKMeans is exactly what it sounds like (uses kmeans++ for the initialization, then does a normal kmeans pass and ignoring outilers). StreamingKMeans implements the online clustering that doesn't return exactly k clusters, (it returns an estimate). This is used to approximate the data.  mapreduce: contains the CentroidWritable, StreamingKMeansDriver, StreamingKMeansMapper and StreamingKMeansReducer classes. CentroidWritable serializes Centroids (sort of like AbstractCluster). StreamingKMeansDriver provides the driver for the job. StreamingKMeansMapper runs StreamingKMeans in the mappers to produce sketches of the data for the reducer. StreamingKMeansReducer collects the centroids produced by the mappers into one set of weighted points and runs BallKMeans on them producing the final results. Additionally the searchers are in o.a.m.math.neighborhood  neighborhood: various searcher classes that implement nearestneighbor search using different strategies. Searcher, UpdatableSearcher: abstract classes that define how to search through collections of vectors. BruteSearch: does a brute search (looks at every point...) ProjectionSearch: uses random projections for searching. FastProjectionSearch: also uses random projections (but not binary search trees as in ProjectionSearch). HashedVector, LocalitySensitiveHashSearch: implement locality sensitive hash search. All the tools that I used are in o.a.m.clustering.streaming [2], under the examples/ project. There are a bunch of classes here, covering everything from vectorizing 20 newsgroups data to various IO utils. The more important ones are: utils.ExperimentUtils: convenience methods. tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming *The relevant issues are:     ? (mapreduce, commandline changes, pom.xml) * 
Description 
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mailarchives.apache.org/mod_mbox/mahoutdev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E [2] https://github.com/dfilimon/mahout Since there will be more than one patches, there will be specific JIRA issues that address each one. The description of the code being added is: The main classes are in o.a.m.clustering.streaming [1], under the core/ project. These are subdivided into 2 packages:  cluster: contains the BallKMeans and StreamingKMeans classes that can be used standalone. BallKMeans is exactly what it sounds like (uses kmeans++ for the initialization, then does a normal kmeans pass and ignoring outilers). StreamingKMeans implements the online clustering that doesn't return exactly k clusters, (it returns an estimate). This is used to approximate the data.  mapreduce: contains the CentroidWritable, StreamingKMeansDriver, StreamingKMeansMapper and StreamingKMeansReducer classes. CentroidWritable serializes Centroids (sort of like AbstractCluster). StreamingKMeansDriver provides the driver for the job. StreamingKMeansMapper runs StreamingKMeans in the mappers to produce sketches of the data for the reducer. StreamingKMeansReducer collects the centroids produced by the mappers into one set of weighted points and runs BallKMeans on them producing the final results. Additionally the searchers are in o.a.m.math.neighborhood  neighborhood: various searcher classes that implement nearestneighbor search using different strategies. Searcher, UpdatableSearcher: abstract classes that define how to search through collections of vectors. BruteSearch: does a brute search (looks at every point...) ProjectionSearch: uses random projections for searching. FastProjectionSearch: also uses random projections (but not binary search trees as in ProjectionSearch). HashedVector, LocalitySensitiveHashSearch: implement locality sensitive hash search. All the tools that I used are in o.a.m.clustering.streaming [2], under the examples/ project. There are a bunch of classes here, covering everything from vectorizing 20 newsgroups data to various IO utils. The more important ones are: utils.ExperimentUtils: convenience methods. tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming *The relevant issues are:     ? (mapreduce, commandline changes, pom.xml) * 
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mailarchives.apache.org/mod_mbox/mahoutdev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E [2] https://github.com/dfilimon/mahout Since there will be more than one patches, there will be specific JIRA issues that address each one. The description of the code being added is: The main classes are in o.a.m.clustering.streaming [1], under the core/ project. These are subdivided into 2 packages:  cluster: contains the BallKMeans and StreamingKMeans classes that can be used standalone. BallKMeans is exactly what it sounds like (uses kmeans++ for the initialization, then does a normal kmeans pass and ignoring outilers). StreamingKMeans implements the online clustering that doesn't return exactly k clusters, (it returns an estimate). This is used to approximate the data.  mapreduce: contains the CentroidWritable, StreamingKMeansDriver, StreamingKMeansMapper and StreamingKMeansReducer classes. CentroidWritable serializes Centroids (sort of like AbstractCluster). StreamingKMeansDriver provides the driver for the job. StreamingKMeansMapper runs StreamingKMeans in the mappers to produce sketches of the data for the reducer. StreamingKMeansReducer collects the centroids produced by the mappers into one set of weighted points and runs BallKMeans on them producing the final results. Additionally the searchers are in o.a.m.math.neighborhood  neighborhood: various searcher classes that implement nearestneighbor search using different strategies. Searcher, UpdatableSearcher: abstract classes that define how to search through collections of vectors. BruteSearch: does a brute search (looks at every point...) ProjectionSearch: uses random projections for searching. FastProjectionSearch: also uses random projections (but not binary search trees as in ProjectionSearch). HashedVector, LocalitySensitiveHashSearch: implement locality sensitive hash search. All the tools that I used are in o.a.m.clustering.streaming [2], under the examples/ project. There are a bunch of classes here, covering everything from vectorizing 20 newsgroups data to various IO utils. The more important ones are: utils.ExperimentUtils: convenience methods. tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming The relevant issues are:     ? (mapreduce, commandline changes, pom.xml) 
Description 
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mailarchives.apache.org/mod_mbox/mahoutdev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E [2] https://github.com/dfilimon/mahout Since there will be more than one patches, there will be specific JIRA issues that address each one. The description of the code being added is: The main classes are in o.a.m.clustering.streaming [1], under the core/ project. These are subdivided into 2 packages:  cluster: contains the BallKMeans and StreamingKMeans classes that can be used standalone. BallKMeans is exactly what it sounds like (uses kmeans++ for the initialization, then does a normal kmeans pass and ignoring outilers). StreamingKMeans implements the online clustering that doesn't return exactly k clusters, (it returns an estimate). This is used to approximate the data.  mapreduce: contains the CentroidWritable, StreamingKMeansDriver, StreamingKMeansMapper and StreamingKMeansReducer classes. CentroidWritable serializes Centroids (sort of like AbstractCluster). StreamingKMeansDriver provides the driver for the job. StreamingKMeansMapper runs StreamingKMeans in the mappers to produce sketches of the data for the reducer. StreamingKMeansReducer collects the centroids produced by the mappers into one set of weighted points and runs BallKMeans on them producing the final results. Additionally the searchers are in o.a.m.math.neighborhood  neighborhood: various searcher classes that implement nearestneighbor search using different strategies. Searcher, UpdatableSearcher: abstract classes that define how to search through collections of vectors. BruteSearch: does a brute search (looks at every point...) ProjectionSearch: uses random projections for searching. FastProjectionSearch: also uses random projections (but not binary search trees as in ProjectionSearch). HashedVector, LocalitySensitiveHashSearch: implement locality sensitive hash search. All the tools that I used are in o.a.m.clustering.streaming [2], under the examples/ project. There are a bunch of classes here, covering everything from vectorizing 20 newsgroups data to various IO utils. The more important ones are: utils.ExperimentUtils: convenience methods. tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming The relevant issues are:     ? (mapreduce, commandline changes, pom.xml) 
An implementation of Streaming KMeans as mentioned in [1] is available here [2].
[1]http://mailarchives.apache.org/mod_mbox/mahoutdev/201303.mbox/%3CCAOwb3gOyf9zufrgXHsucpkJXk6cW0Nnr8GwG__JSey+kVABeyg@mail.gmail.com%3E [2] https://github.com/dfilimon/mahout Since there will be more than one patches, there will be specific JIRA issues that address each one. The description of the code being added is: The main classes are in o.a.m.clustering.streaming [1], under the core/ project. These are subdivided into 2 packages:  cluster: contains the BallKMeans and StreamingKMeans classes that can be used standalone. BallKMeans is exactly what it sounds like (uses kmeans++ for the initialization, then does a normal kmeans pass and ignoring outilers). StreamingKMeans implements the online clustering that doesn't return exactly k clusters, (it returns an estimate). This is used to approximate the data.  mapreduce: contains the CentroidWritable, StreamingKMeansDriver, StreamingKMeansMapper and StreamingKMeansReducer classes. CentroidWritable serializes Centroids (sort of like AbstractCluster). StreamingKMeansDriver provides the driver for the job. StreamingKMeansMapper runs StreamingKMeans in the mappers to produce sketches of the data for the reducer. StreamingKMeansReducer collects the centroids produced by the mappers into one set of weighted points and runs BallKMeans on them producing the final results. Additionally the searchers are in o.a.m.math.neighborhood  neighborhood: various searcher classes that implement nearestneighbor search using different strategies. Searcher, UpdatableSearcher: abstract classes that define how to search through collections of vectors. BruteSearch: does a brute search (looks at every point...) ProjectionSearch: uses random projections for searching. FastProjectionSearch: also uses random projections (but not binary search trees as in ProjectionSearch). HashedVector, LocalitySensitiveHashSearch: implement locality sensitive hash search. All the tools that I used are in o.a.m.clustering.streaming [2], under the examples/ project. There are a bunch of classes here, covering everything from vectorizing 20 newsgroups data to various IO utils. The more important ones are: utils.ExperimentUtils: convenience methods. tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. [3] https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming [4] https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming The relevant issues are:     
Fix Version/s  0.8 [ 12320153 ] 
Assignee  Dan Filimon [ dfilimon ] 
Status  Open [ 1 ]  Resolved [ 5 ] 
Resolution  Fixed [ 1 ] 
Status  Resolved [ 5 ]  Closed [ 6 ] 