On Sat, May 21, 2011 at 5:47 PM, Daniel McEnnis (JIRA) <firstname.lastname@example.org> wrote:
1. Use case: This is the algorithm for those learning problems that are simply too massive even for Mahout's memory streamlined algorithms. Particularly for knn, its the advertising company with 50,000 classes of people, tens to hundreds of millions of examples and many terabytes of log data to classify which type of person a log belongs to. Memory footprint becomes the biggest issue as even the model takes more memory than what is available. For the other Mahout classifiers, training data size is limited to available memory on data nodes.
Actually not. In fact, this is not true for any of the other model training algorithms in Mahout except kind of sort of, but not really for the random forest. For the Naive Bayes algorithms and the SGD algorithms it is distinctly not true.
3. These distance measures have very different assumptions from those in recommendation. A missing vector entry (say in sparse vector format) means 0, not missing. This requires a hack of all distance measures to accommodate it.
I don't see why. Most of the other distance measures in Mahout use this same convention. Certainly v1.getDifferenceSquared and v1.minus(v2).assign(Functions.abs).sum() would give you results that assume 0's for missing elements.
I really think that the sub-classes of org.apache.mahout.common.distance.DistanceMeasure do just what you are saying that you want.
The measures are also 0 - Infinity, not -1 - 1 and the smaller the better. Cosine distance doesn't fit this, so its got a transform to map it to 0-2 where smaller is better.
My point was that cosine distance is essentially the same as Euclidean distance. Why not just use that?
KL Distance is based on entropy. I'll double check my references for the details.
I am pretty sure that you are looking at Kuhlback-Liebler divergence. I think you just need to put in a wikipedia reference. Your javadoc is not quite correct in any case.
5. standard classifier - Until today, I thought this was specific to the Bayes algorithm. I'll add it to the next patch.
Look at org.apache.mahout.classifier.AbstractVectorClassifier
6. usability. Any user reading the javadoc on the entry classes ModelBuilder, Classifier, or TestClassifier have instructions on how to setup data for this patch. All three should have their options explained.
That isn't want I meant. Command line documentation is all well and good, but there should be a usable API as well, especially for deployment in a working system. Very few systems can afford to do an entire map-reduce when they just want to classify a few data points.
I'll add it to the list of things to put in the next patch. My understanding was that there is no standard for at least input formats in Mahout. This patch describes my proposal for what input formats each Mahout component ought to be able to process.
If you are pushing for a standard, then that should be independent of your classifier and you should explain how that interacts with, say, the hashed vector encoding framework. See org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder