Details

Type: Improvement

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: 0.1, 0.2, 0.3, 0.4

Fix Version/s: 0.6

Component/s: Classification, Clustering

Labels:None
Description
Opening this JIRA issue to collect ideas on how to streamline our classification and clustering algorithms to make integration for users easier as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs
Jake and Robin and I were talking the other evening and a common lament was that our classification (and clustering) stuff was all over the map in terms of data structures. Driving that to rest and getting those comments even vaguely as plug and play as our much more advanced recommendation components would be very, very helpful.
This issue probably also realates to MAHOUT287 (intention there is to make naive bayes run on vectors as input).
Ted, Jake, Robin: Would be great if someone of you could add a comment on some of the issues you discussed "the other evening" and (if applicable) any minor or major changes you think could help solve this issue.
Issue Links
 relates to

MAHOUT228 Need sequential logistic regression implementation using SGD techniques
 Closed
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Thanks for getting the ball rolling Isabel
More discussion from the mailing list can be found here http://markmail.org/thread/tarn6f4ump5zn67n
One thought I remember reading somewhere was instead of using a common datatype as input for classifiers and clusterers, provide something like an InputSource interface – which defines how things are read and can be implemented for any number of physical input types, e.g: text files, serialized vectors in sequence files, various types of distributed data stores)... we provide a number of input sources with the appropriate methods and allow users of Mahout to implement their own.
Some other discussion I'm particularly interested in is how we might unify models across the classification and clustering code.
Unification of the resulting models is probably much easier than the unification of the model building process itself.
Some of the problems I have seen include:
a) all of our clustering and classification models should be able to accept vectors and produce either a "mostlikely" category or a vector of scores for all possible categories. Unfortunately, there is no uniform way to load a model from a file and no uniform object structure for these models and no consistent way to call them.
b) most of our learning algorithms would be happy with vectors, but there is a pretty fundamental difference between good ways to call hadoopbased and sequential training algorithms. The sequential stuff is traditional java so the interface is very easy. The parallel stuff is considerably harder to make into a really good interface. We may learn some tricks with Plume or we may be able to use the Distributed Row Matrix, but it isn't an obvious answer.
c) in some cases, the vectors are noticeably larger than the original data. This occurs when the original data is very sparse and we are looking at lots of interaction variables. Again, for sequential algorithms, this is pretty easy to deal with, but for parallel ones, it really might be better to store the original data and pass in a function that handles the vectorization on the fly.
Regarding vectorization strategies in general, I know of three that we currently use:
a) fullsize vector model. Each continuous variable and each unique term for each textlike variable and each unique interaction term is assigned a unique location in the feature vector. This requires that we know the vocabulary and the number of interaction combinations that exist. This is the strategy used by the Lucene to vector converter.
b) pass around the text. This is what Naive Bayes currently does. The result is kind of like a duplicated vector implementation with strings as the indices. The current implementation has difficult with different fields containing text and with continuous variables. This avoids a dictionary building pass, but leads to other difficulties unless we figure out a way to inject a text vectorizer.
c) feature hashing. This is what SGD supports. Here, we pick the size of the vectors a priori. Words are assigned a location based on a hash of the variable name and the word value. Continuous variables are assigned locations based on the variable name. It can be moderately difficult to reverse engineer a vector back to features since there can be ambiguity with very large feature spaces, but it isn't necessary to build a dictionary in order to make vectors.
I am about to put up a patch related to MAHOUT228 that proposes some interfaces for online learning and vector classification.
These interfaces are probably too special purpose to be considered an answer to MAHOUT479, but they indicate where my own work has been leading me lately.
I just moved the encoding objects associated with MAHOUT228 to org.apache.mahout.vectors to provide a nucleus for feature encoding.
There are also a fair number of things in oam.text and oam.utils that are related. Since those are in the utils module, however, I couldn't leverage them. We may want to consider moving some of them to core to allow wider use.
I'm making some progress on integrating the clustering data structures. Here's how the top levels are shaping up:
// this defines the minimal Dirichlet model public interface Model<O> extends Writable { double pdf(O x); void observe(O x); int count(); void computeParameters(); public Model<VectorWritable> sampleFromPosterior(); } // this puts a face on a cluster public interface Cluster { int getId(); Vector getCenter(); Vector getRadius(); int getNumPoints(); String asFormatString(String[] bindings); String asJsonString(); } // here's the resulting Cluster class hierarchy public abstract class AbstractCluster implements Cluster, Model<VectorWritable> {} public abstract class DistanceMeasureCluster extends AbstractCluster {} public class Canopy extends DistanceMeasureCluster {} public class Cluster extends DistanceMeasureCluster {} public class MeanShiftCanopy extends Cluster {} public class SoftCluster extendsCluster {} public class GaussianCluster extends AbstractCluster {} // Note: all the current Dirichlet models can be subsumed by GaussianCluster or simple subclasses
There's a fair amount of cleanup to tests etc to do but I thought I'd post this for visibility.
Jeff,
How do you think that these relate to AbstractVectorClassifier?
Does it fit reasonably as a superclass of the models?
I don't see AbstractVectorClassifier as a superclass of these models, since it needs to operate on a set of models rather than an individual model. What I see is something more like below. I realize classify() does not return the right sized vector, but how else to normalize it properly given that an arbitrary set of model pdfs won't sum to 1? Also, what about making AbstractVectorClassifier work over VectorWritables instead of Vectors? All the clustering code uses VWs.
public class VectorModelClassifier extends AbstractVectorClassifier { List<Model<VectorWritable>> models; public VectorModelClassifier(List<Model<VectorWritable>> models) { super(); this.models = models; } @Override public Vector classify(Vector instance) { Vector pdfs = new DenseVector(models.size()); int i = 0; for (Model<VectorWritable> model : models) { pdfs.set(i++, model.pdf(new VectorWritable(instance))); } return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum()); } @Override public double classifyScalar(Vector instance) { if (models.size() == 2) { double pdf0 = models.get(0).pdf(new VectorWritable(instance)); double pdf1 = models.get(1).pdf(new VectorWritable(instance)); return pdf0 / (pdf0 + pdf1); } throw new IllegalStateException(); } @Override public int numCategories() { return models.size(); } }
Integrated in MahoutQuality #202 (See https://hudson.apache.org/hudson/job/MahoutQuality/202/)
MAHOUT479: Refactored Model and Cluster hierarchies to include DistanceMeasure in cluster state. All unit tests run
Integrated in MahoutQuality #208 (See https://hudson.apache.org/hudson/job/MahoutQuality/208/)
MAHOUT479: This commit refactors Cluster to inherit from Model<VectorWritable> instead of AbstractCluster
which now inherits just Cluster. The existing Dirichlet models also now inherit from Cluster, simplifying the use
of generics and cleaning up a lot of the code.
Since Dirichlet now can iterate over arbitrary Clusters as its models, this opens up the entire set of DistanceMeasure
based clusters for Dirichlet processing. This should allow the output of e.g. Canopy to become the prior model
distribution in a subsequent Dirichlet step. Is this a feature?
The AbstractCluster hierarchy has been adjusted allowing GaussianClusterDistribution and
DistanceMeasureClusterDistribution to instantiate its subclasses. Both have tests and seem to work as expected.
All unit tests run. More to follow
Integrated in MahoutQuality #213 (See https://hudson.apache.org/hudson/job/MahoutQuality/213/)
MAHOUT479: Fixed a bug in TestVectorModelClassifier and messed around with pdf() in GaussianCluster and ASNModel. Synthetic control seems to work better on Dirichlet now but I'm troubled by the impact of the two pdf() implementations on the outcome.
Integrated in MahoutQuality #241 (See https://hudson.apache.org/hudson/job/MahoutQuality/241/)
I did a commit recently that introduced ModelDissector. This is useful for reverse engineering feature hashed models.
The idea is that the hashed encoders have the option of having a trace dictionary. This tells us where each feature is hashed to, or each feature/value combination in the case of wordlike values. Using this dictionary, we can put values into a synthetic feature vector in just the locations specified by a single feature or interaction. Then we can push this through a linear model to see the contribution of that input. For any generalized linear model like logistic regression, there is a linear part of the model that allows this.
What the ModelDissector does is to accept a trace dictionary and a model in an update method. Then in a flush method, the biggest weights are returned. This update/flush style is used so that the trace dictionary doesn't have to grow to enormous levels, but instead can be cleared
between updates.
Integrated in MahoutQuality #300 (See https://hudson.apache.org/hudson/job/MahoutQuality/300/)
MAHOUT479  Added javadoc explanations.
Moving this from neverland to 0.5. A lot has been accomplished in 0.4 in the clustering jobs but more work is needed to converge with classification, SVD and perhaps recommenders
I like the idea of getting this in for 0.5, building on the work already done. Is there a reasonable "stopping point" in sight that would be a good place to leave it for 0.5? I'm supposing we'll start heading to release within a month.
Are there any nearterm changes that are worth getting in for 0.5? It is looking like this won't be fully resolved for 0.5
I think we have bits of this already in, but the grand unified theory is
still lacking.
I wouldn't expect it for 0.5.
Ted: Yeah... this is what I had in mind when I said grand unified theory.
On Wed, Apr 13, 2011 at 9:24 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote:
> This could potentially collapse kmeans, fuzzyk and Dirichlet into a single implementation too:
>
>  Begin with a prior ClusterClassifier containing the appropriate sort of Cluster, in clustersn
>  For each input Vector, compute the pdf vector using CC.classify()
> – For kmeans, train the most likely model from the pdf vector
> – For Dirichlet, train the model selected by the multinomial of the pfd vector * mixture vector
> – For fuzzyk, train each model by its normalized pdf (would need a new classify method for this)
>  Close the CC, computing all posterior model parameters
>  Serialize the CC into clustersn+1
>
> Now that would really be cool
>
>
> On 4/13/11 9:00 PM, Jeff Eastman wrote:
>
>> Here's how I got there:
>>
>>  ClusterClassifier holds a "List<Cluster> models;" field as its only state just like VectorModelClassifier does
>>  Started with ModelSerializerTest since you suggested being compatible with ModelSerializer
>>  This tests OnlineLogisticRegression, CrossFoldLearner and AdaptiveLogisticRegression
>>  The first two are also subclasses of AbstractVectorClassifier just like ClusterClassifier
>>  The tests pass OLR and CFL learners to train(OnlineLearner) so it made sense for a CC to be an OL too
>>  The new CC.train(...) methods map to "models.get(actual).observe()" in Cluster.observe(V)
>>  CC.close() maps to cluster.computeParameters() for each model which computes the posterior cluster parameters
>>  Now the CC is ready for another iteration or to classify, etc.
>>
>> So, the cluster iteration process starts with a prior List<Cluster>
>> which is used to construct the ClusterClassifier. Then in each
>> iteration each point is passed to CC.classify() and the maximum
>> probability element index in the returned Vector is used to train()
>> the CC. Since all the DistanceMeasureClusters contain their
>> appropriate DistanceMeasure, the one with the maximum pdf() is the
>> closest. Just what kmeans already does but done less efficiently (it
>> uses just the minimum distance, but pdf() = e^distance so the closest cluster has the largest pdf()).
>>
>> Finally, instead of passing in a List<Cluster> in the KMeansClusterer
>> I can just carry around a CC which wraps it. Instead of serializing a
>> List<Cluster> at the end of each iteration I can just serialize the
>> CC. At the beginning of the next iteration, I just deserialize it and go.
Here's an interesting piece of code. It nets out the clustering process in terms of the ClusterClassifier, which is an AbstractVectorClassifier which implements OnlineLearner and Writable. The policy for kmeans just returns the index of the max pdf element and the policy for dirichlet returns the multinomial of the mixture times the pdf. The really exciting thing to me is how, for clustering, we classify and then train whereas for classification it is the opposite order. Very symmetric and, uh, now rather obvious.
private ClusteringPolicy policy; /** * Iterate over data using a priortrained classifier, for a number of iterations * @param data a List<Vector> of input vectors * @param prior the priortrained ClusterClassifier * @param numIterations the int number of iterations to perform * @return the posterior ClusterClassifier */ public ClusterClassifier iterate(List<Vector> data, ClusterClassifier prior, int numIterations) { for (int iteration = 1; iteration <= numIterations; iteration++) { for (Vector vector : data) { // classification yields probabilities Vector pdfs = prior.classify(vector); // policy selects a model given those probabilities int selected = policy.select(pdfs); // training causes all models to observe data prior.train(selected, vector); } // compute the posterior models prior.close(); // update the policy policy.update(prior); } } return prior; }
Integrated in MahoutQuality #769 (See https://builds.apache.org/hudson/job/MahoutQuality/769/)
Integrated in MahoutQuality #805 (See https://builds.apache.org/hudson/job/MahoutQuality/805/)
MAHOUT479: added a new iterate method to ClusterIterator. Method accepts 3
hadoop Paths for input, prior and output information plus number of desired iterations. All algorithm data is pulledfrom/pushedto SequenceFiles. Added a unit test and improved the example DisplayKMeans, DisplayFuzzyKMeans and DisplayDirichlet to use the new filebased implementation. Check out Dirichlet.
Hi Jeff,
I am running Dirichlet clustering over the Reuters data set. Due to issues with the resulting clusters I found that there is a problem with the function calculating the pdf in org.apache.mahout.clustering.dirichlet.models.GaussianCluster and with the change you made recently:
1. Part of the clusters have very small radius << 0. This leads to UncommonDistributions.dNorm returning 0.0 in case the point is at a bigger distance from the mean
2. dNorm returns probability density, not probability, which means that for the cases where radius << 0 and the number of dimensions of the feature vectors is very big (~50000) the pdf goes quickly to infinity.
3. In case 1 and 2 happen the result for the pdf is NaN
Am I right that this is done? Looks so from the JIRA commits. Or at least, is it better to continue further work in a new JIRA?
Some thoughts that come to my mind:
Probably have forgotten like dozens of other open questions  any input welcome.
+ Could potentially help with
MAHOUT287, however I need some help understanding the existing code.