Mahout
  1. Mahout
  2. MAHOUT-668

Adding knn support to Mahout classifiers

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.6
    • Fix Version/s: None
    • Component/s: Classification

      Description

      Initial implementation of the knn. This is a minimum base set with many more possible add-ons including support for text and weka input as well as a classify only (no confusion matrix) back end. The system was tested on the 20 newsgroup data set.

      1. Mahout-668-3.patch
        159 kB
        Daniel McEnnis
      2. Mahout-668-3.patch
        167 kB
        Daniel McEnnis
      3. Mahout-668-3.patch
        167 kB
        Daniel McEnnis
      4. Mahout-668-3.patch
        175 kB
        Daniel McEnnis
      5. Mahout-668-3.patch
        20 kB
        Daniel McEnnis
      6. Mahout-668-3.patch
        175 kB
        Daniel McEnnis
      7. Mahout-668-3.patch
        190 kB
        Daniel McEnnis
      8. Mahout-668-2.patch
        20 kB
        Daniel McEnnis
      9. MAHOUT-668.pat
        20 kB
        Daniel McEnnis
      10. Mahout-668.pat
        18 kB
        Daniel McEnnis

        Activity

        Hide
        Daniel McEnnis added a comment -

        Initial release of knn. Needs to have more functionality added and possibly a refactoring.

        Show
        Daniel McEnnis added a comment - Initial release of knn. Needs to have more functionality added and possibly a refactoring.
        Hide
        Daniel McEnnis added a comment -

        My development environment is not on the SVN checkout. One of the files was not copied over.

        Show
        Daniel McEnnis added a comment - My development environment is not on the SVN checkout. One of the files was not copied over.
        Hide
        Daniel McEnnis added a comment -

        I really think we should do knn. I have attached a 29 class patch implementing this.

        As originally designed, it is O(1) in memory usage, scales up to models the size of a data node's harddisk, and has implementations for log files, text files, tokenized text files, and Weka ARFF files.

        Show
        Daniel McEnnis added a comment - I really think we should do knn. I have attached a 29 class patch implementing this. As originally designed, it is O(1) in memory usage, scales up to models the size of a data node's harddisk, and has implementations for log files, text files, tokenized text files, and Weka ARFF files.
        Hide
        Daniel McEnnis added a comment -

        It helps if I attach the right patch.

        Show
        Daniel McEnnis added a comment - It helps if I attach the right patch.
        Hide
        Ted Dunning added a comment -

        Style comment: Please add javadocs and remove @author tags

        Real comment: Metrics like CityBlock already exist in Mahout. If you need to re-implement them, you probably will have better
        results if you use matrix/vector operations instead of explicit loops. This is especially true when the assumptions that led
        you to your loop structure are violated.

        Show
        Ted Dunning added a comment - Style comment: Please add javadocs and remove @author tags Real comment: Metrics like CityBlock already exist in Mahout. If you need to re-implement them, you probably will have better results if you use matrix/vector operations instead of explicit loops. This is especially true when the assumptions that led you to your loop structure are violated.
        Hide
        Ted Dunning added a comment -

        More style: Please make instance variables private where ever possible. Also, weaken types where possible. Don't use a HashMap if you really only care about it being a Map. Do use the Mahout standard indentation.

        Content: Why did you not use the Dictionary class to manage a set of id's?

        Content: Isn't there already a cosine distance available? Likewise, isn't there a distance class available?

        Content: Your dot product distance is a little odd. Why use 1/(a \dot b)? Why not use - (a \dot b)? Are you assuming a and b are normalized? If so, why not use euclidean distance (aka L2)?

        Style: Can you provide a wikipedia link for your definition of KL distances? Do you mean KL as Kulback-Liebler or Karhonen-Loeve?

        Style: Don't comment out code that you want to delete. Just delete it.

        Content: I think that your use of State variables is confused especially in, for instance, the MasterVector class. Why do you do it? What does it really mean? What is the advantage?

        Style: Isn't there a better name for MasterVector? How about IdfCountDictionary or something?

        Style: Spell check your javadoc

        Style: The javadoc on TestClassifier seems very confusing.

        Overall, I think that this needs a lot of work.

        Show
        Ted Dunning added a comment - More style: Please make instance variables private where ever possible. Also, weaken types where possible. Don't use a HashMap if you really only care about it being a Map. Do use the Mahout standard indentation. Content: Why did you not use the Dictionary class to manage a set of id's? Content: Isn't there already a cosine distance available? Likewise, isn't there a distance class available? Content: Your dot product distance is a little odd. Why use 1/(a \dot b)? Why not use - (a \dot b)? Are you assuming a and b are normalized? If so, why not use euclidean distance (aka L2)? Style: Can you provide a wikipedia link for your definition of KL distances? Do you mean KL as Kulback-Liebler or Karhonen-Loeve? Style: Don't comment out code that you want to delete. Just delete it. Content: I think that your use of State variables is confused especially in, for instance, the MasterVector class. Why do you do it? What does it really mean? What is the advantage? Style: Isn't there a better name for MasterVector? How about IdfCountDictionary or something? Style: Spell check your javadoc Style: The javadoc on TestClassifier seems very confusing. Overall, I think that this needs a lot of work.
        Hide
        Ted Dunning added a comment -

        What is the overall use case here? It seems like this is a not very motivated collection of command line
        programs that have arbitrary choices of distances and methods.

        Why doesn't it fit into the standard classifier API more?

        Is there any roadmap document that would describe how to use these classifiers? Could a Mahout user of some other kind
        of classifier guess how to use these classes?

        What I would much rather see is something that works on Vector's and which has a well-defined on-disk format for a model. Then it would be nice to have good and fast parallel and sequential training code. The sequential training code should emulate on-line training and implement the standard API's. You should allow old state to be updated and then written back to disk with the close method. Deployment should be possible analogously to the way that the LogisticRegression stuff does it. The ModelSerializer should be able to load and save this kind of model. It would be very fine if the model itself were a writable.

        Show
        Ted Dunning added a comment - What is the overall use case here? It seems like this is a not very motivated collection of command line programs that have arbitrary choices of distances and methods. Why doesn't it fit into the standard classifier API more? Is there any roadmap document that would describe how to use these classifiers? Could a Mahout user of some other kind of classifier guess how to use these classes? What I would much rather see is something that works on Vector's and which has a well-defined on-disk format for a model. Then it would be nice to have good and fast parallel and sequential training code. The sequential training code should emulate on-line training and implement the standard API's. You should allow old state to be updated and then written back to disk with the close method. Deployment should be possible analogously to the way that the LogisticRegression stuff does it. The ModelSerializer should be able to load and save this kind of model. It would be very fine if the model itself were a writable.
        Hide
        Daniel McEnnis added a comment -

        Thank you, Ted, for putting so much time into this. I'll do my best to answer as consisely and completely as possible.

        1. Use case: This is the algorithm for those learning problems that are simply too massive even for Mahout's memory streamlined algorithms. Particularly for knn, its the advertising company with 50,000 classes of people, tens to hundreds of millions of examples and many terabytes of log data to classify which type of person a log belongs to. Memory footprint becomes the biggest issue as even the model takes more memory than what is available. For the other Mahout classifiers, training data size is limited to available memory on data nodes.

        2. I forgot to add javadoc to the test classes. I'll fix that for the next patch.

        3. These distance measures have very different assumptions from those in recommendation. A missing vector entry (say in sparse vector format) means 0, not missing. This requires a hack of all distance measures to accommodate it. The measures are also 0 - Infinity, not -1 - 1 and the smaller the better. Cosine distance doesn't fit this, so its got a transform to map it to 0-2 where smaller is better. KL Distance is based on entropy. I'll double check my references for the details.

        4. MasterVector and ClassLabelVector- I created my own Dictionary class because of my difficulty understanding it. I'm willing to switch, it just means taking more time to understand the code. The name is arbitrary. I can change it as needed. DfCountDictionary works better for me as its not an inverted reference.

        5. standard classifier - Until today, I thought this was specific to the Bayes algorithm. I'll add it to the next patch.

        6. usability. Any user reading the javadoc on the entry classes ModelBuilder, Classifier, or TestClassifier have instructions on how to setup data for this patch. All three should have their options explained. I'll add it to the list of things to put in the next patch. My understanding was that there is no standard for at least input formats in Mahout. This patch describes my proposal for what input formats each Mahout component ought to be able to process.

        7. still working on model suggestions....

        Show
        Daniel McEnnis added a comment - Thank you, Ted, for putting so much time into this. I'll do my best to answer as consisely and completely as possible. 1. Use case: This is the algorithm for those learning problems that are simply too massive even for Mahout's memory streamlined algorithms. Particularly for knn, its the advertising company with 50,000 classes of people, tens to hundreds of millions of examples and many terabytes of log data to classify which type of person a log belongs to. Memory footprint becomes the biggest issue as even the model takes more memory than what is available. For the other Mahout classifiers, training data size is limited to available memory on data nodes. 2. I forgot to add javadoc to the test classes. I'll fix that for the next patch. 3. These distance measures have very different assumptions from those in recommendation. A missing vector entry (say in sparse vector format) means 0, not missing. This requires a hack of all distance measures to accommodate it. The measures are also 0 - Infinity, not -1 - 1 and the smaller the better. Cosine distance doesn't fit this, so its got a transform to map it to 0-2 where smaller is better. KL Distance is based on entropy. I'll double check my references for the details. 4. MasterVector and ClassLabelVector- I created my own Dictionary class because of my difficulty understanding it. I'm willing to switch, it just means taking more time to understand the code. The name is arbitrary. I can change it as needed. DfCountDictionary works better for me as its not an inverted reference. 5. standard classifier - Until today, I thought this was specific to the Bayes algorithm. I'll add it to the next patch. 6. usability. Any user reading the javadoc on the entry classes ModelBuilder, Classifier, or TestClassifier have instructions on how to setup data for this patch. All three should have their options explained. I'll add it to the list of things to put in the next patch. My understanding was that there is no standard for at least input formats in Mahout. This patch describes my proposal for what input formats each Mahout component ought to be able to process. 7. still working on model suggestions....
        Hide
        Daniel McEnnis added a comment -

        Ted,

        Your right. The distance metrics will have trouble with Random Vectors. I'll work on a fix for that. (The code is on the critical path, I can't afford to lose the speed of the current method and the other vector methods give incorrect results for missing=0 vectors)

        Daniel.

        Show
        Daniel McEnnis added a comment - Ted, Your right. The distance metrics will have trouble with Random Vectors. I'll work on a fix for that. (The code is on the critical path, I can't afford to lose the speed of the current method and the other vector methods give incorrect results for missing=0 vectors) Daniel.
        Hide
        Ted Dunning added a comment -

        Your right. The distance metrics will have trouble with Random Vectors. I'll work on a fix for that. (The code is on the critical path, I can't afford to lose the speed of the current method and the other vector methods give incorrect results for missing=0 vectors)

        Sparse vectors in Mahout assume that missing elements are 0.

        Are you saying that you want to consider missing elements as something other than 0? Your javadoc didn't seem to say that.

        You should get the same results either way.

        Show
        Ted Dunning added a comment - Your right. The distance metrics will have trouble with Random Vectors. I'll work on a fix for that. (The code is on the critical path, I can't afford to lose the speed of the current method and the other vector methods give incorrect results for missing=0 vectors) Sparse vectors in Mahout assume that missing elements are 0. Are you saying that you want to consider missing elements as something other than 0? Your javadoc didn't seem to say that. You should get the same results either way.
        Hide
        Ted Dunning added a comment -

        On Sat, May 21, 2011 at 5:47 PM, Daniel McEnnis (JIRA) <jira@apache.org> wrote:
        1. Use case: This is the algorithm for those learning problems that are simply too massive even for Mahout's memory streamlined algorithms. Particularly for knn, its the advertising company with 50,000 classes of people, tens to hundreds of millions of examples and many terabytes of log data to classify which type of person a log belongs to. Memory footprint becomes the biggest issue as even the model takes more memory than what is available. For the other Mahout classifiers, training data size is limited to available memory on data nodes.

        Actually not. In fact, this is not true for any of the other model training algorithms in Mahout except kind of sort of, but not really for the random forest. For the Naive Bayes algorithms and the SGD algorithms it is distinctly not true.

        3. These distance measures have very different assumptions from those in recommendation. A missing vector entry (say in sparse vector format) means 0, not missing. This requires a hack of all distance measures to accommodate it.

        I don't see why. Most of the other distance measures in Mahout use this same convention. Certainly v1.getDifferenceSquared and v1.minus(v2).assign(Functions.abs).sum() would give you results that assume 0's for missing elements.

        I really think that the sub-classes of org.apache.mahout.common.distance.DistanceMeasure do just what you are saying that you want.

        The measures are also 0 - Infinity, not -1 - 1 and the smaller the better. Cosine distance doesn't fit this, so its got a transform to map it to 0-2 where smaller is better.

        My point was that cosine distance is essentially the same as Euclidean distance. Why not just use that?

        KL Distance is based on entropy. I'll double check my references for the details.

        I am pretty sure that you are looking at Kuhlback-Liebler divergence. I think you just need to put in a wikipedia reference. Your javadoc is not quite correct in any case.

        5. standard classifier - Until today, I thought this was specific to the Bayes algorithm. I'll add it to the next patch.

        Look at org.apache.mahout.classifier.AbstractVectorClassifier

        6. usability. Any user reading the javadoc on the entry classes ModelBuilder, Classifier, or TestClassifier have instructions on how to setup data for this patch. All three should have their options explained.

        That isn't want I meant. Command line documentation is all well and good, but there should be a usable API as well, especially for deployment in a working system. Very few systems can afford to do an entire map-reduce when they just want to classify a few data points.

        I'll add it to the list of things to put in the next patch. My understanding was that there is no standard for at least input formats in Mahout. This patch describes my proposal for what input formats each Mahout component ought to be able to process.

        If you are pushing for a standard, then that should be independent of your classifier and you should explain how that interacts with, say, the hashed vector encoding framework. See org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder

        Show
        Ted Dunning added a comment - On Sat, May 21, 2011 at 5:47 PM, Daniel McEnnis (JIRA) <jira@apache.org> wrote: 1. Use case: This is the algorithm for those learning problems that are simply too massive even for Mahout's memory streamlined algorithms. Particularly for knn, its the advertising company with 50,000 classes of people, tens to hundreds of millions of examples and many terabytes of log data to classify which type of person a log belongs to. Memory footprint becomes the biggest issue as even the model takes more memory than what is available. For the other Mahout classifiers, training data size is limited to available memory on data nodes. Actually not. In fact, this is not true for any of the other model training algorithms in Mahout except kind of sort of, but not really for the random forest. For the Naive Bayes algorithms and the SGD algorithms it is distinctly not true. 3. These distance measures have very different assumptions from those in recommendation. A missing vector entry (say in sparse vector format) means 0, not missing. This requires a hack of all distance measures to accommodate it. I don't see why. Most of the other distance measures in Mahout use this same convention. Certainly v1.getDifferenceSquared and v1.minus(v2).assign(Functions.abs).sum() would give you results that assume 0's for missing elements. I really think that the sub-classes of org.apache.mahout.common.distance.DistanceMeasure do just what you are saying that you want. The measures are also 0 - Infinity, not -1 - 1 and the smaller the better. Cosine distance doesn't fit this, so its got a transform to map it to 0-2 where smaller is better. My point was that cosine distance is essentially the same as Euclidean distance. Why not just use that? KL Distance is based on entropy. I'll double check my references for the details. I am pretty sure that you are looking at Kuhlback-Liebler divergence. I think you just need to put in a wikipedia reference. Your javadoc is not quite correct in any case. 5. standard classifier - Until today, I thought this was specific to the Bayes algorithm. I'll add it to the next patch. Look at org.apache.mahout.classifier.AbstractVectorClassifier 6. usability. Any user reading the javadoc on the entry classes ModelBuilder, Classifier, or TestClassifier have instructions on how to setup data for this patch. All three should have their options explained. That isn't want I meant. Command line documentation is all well and good, but there should be a usable API as well, especially for deployment in a working system. Very few systems can afford to do an entire map-reduce when they just want to classify a few data points. I'll add it to the list of things to put in the next patch. My understanding was that there is no standard for at least input formats in Mahout. This patch describes my proposal for what input formats each Mahout component ought to be able to process. If you are pushing for a standard, then that should be independent of your classifier and you should explain how that interacts with, say, the hashed vector encoding framework. See org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder
        Hide
        Daniel McEnnis added a comment -

        I've fixed Javadoc and Abstract VectorClassifier API compliance in this patch.

        Show
        Daniel McEnnis added a comment - I've fixed Javadoc and Abstract VectorClassifier API compliance in this patch.
        Hide
        Daniel McEnnis added a comment -

        Ted,

        Cosine distance is quite different from Euclidean distance. In Euclidean, the size of the file drives the distance metric, in Cosine, the angle between the two file vectors is the only measure taken. Also, here is a difference be3tween my version of Euclidean metric and the one already present:

        present:

        {NaN,1,NaN}

        x

        {1,1,1} = 0.0 distance
        new: {0,1,0}X{1,1,1}

        = 1.47 distance

        Daniel.

        Show
        Daniel McEnnis added a comment - Ted, Cosine distance is quite different from Euclidean distance. In Euclidean, the size of the file drives the distance metric, in Cosine, the angle between the two file vectors is the only measure taken. Also, here is a difference be3tween my version of Euclidean metric and the one already present: present: {NaN,1,NaN} x {1,1,1} = 0.0 distance new: {0,1,0}X{1,1,1} = 1.47 distance Daniel.
        Hide
        Daniel McEnnis added a comment -

        Added Apache License declaration to VectorClassifier.java

        Show
        Daniel McEnnis added a comment - Added Apache License declaration to VectorClassifier.java
        Hide
        Daniel McEnnis added a comment -

        As requested, MasterVector->DfCounter, javadoc upgrade on KLDistance, and additional code to handle random access vectors in distance metrics.

        Show
        Daniel McEnnis added a comment - As requested, MasterVector->DfCounter, javadoc upgrade on KLDistance, and additional code to handle random access vectors in distance metrics.
        Hide
        Daniel McEnnis added a comment -

        I've implemented a parallel training option. I've held off on the model update because I can not find an interface in any of the other classifiers. Ted, can you point me to the right interface for model update. I seem to have implemented everything else holding up commit.

        Show
        Daniel McEnnis added a comment - I've implemented a parallel training option. I've held off on the model update because I can not find an interface in any of the other classifiers. Ted, can you point me to the right interface for model update. I seem to have implemented everything else holding up commit.
        Hide
        Daniel McEnnis added a comment -

        I created the patch from the wrong tree .

        Show
        Daniel McEnnis added a comment - I created the patch from the wrong tree .
        Hide
        Ted Dunning added a comment -

        Daniel,

        What exactly do you mean by model update?

        My confusion stems from the fact that online algorithms have nothing but update. Just load the model and start calling train with new data. Then save it.

        That seems simple enough that I suspect I don't understand the question.

        Show
        Ted Dunning added a comment - Daniel, What exactly do you mean by model update? My confusion stems from the fact that online algorithms have nothing but update. Just load the model and start calling train with new data. Then save it. That seems simple enough that I suspect I don't understand the question.
        Hide
        Daniel McEnnis added a comment -

        Ted,

        On the contrary. The only method of creating a model in Bayes I've found uses tokenized text with no way to specify a model. If specifying a model when loading code is the interface, then great, but none of the classifier have it. It has some serious issues with text versus non-text models but I can live with that. But what your describing is loading a model, which AFAIK Bayes doesn't have. If this is not true please point me at the code so I can study it. Thank you,

        Daniel.

        Show
        Daniel McEnnis added a comment - Ted, On the contrary. The only method of creating a model in Bayes I've found uses tokenized text with no way to specify a model. If specifying a model when loading code is the interface, then great, but none of the classifier have it. It has some serious issues with text versus non-text models but I can live with that. But what your describing is loading a model, which AFAIK Bayes doesn't have. If this is not true please point me at the code so I can study it. Thank you, Daniel.
        Hide
        Ted Dunning added a comment -

        So sorry... I didn't look enough at context.

        I was referring to the things that inherit from OnlineLearner.

        You are correct that Bayes isn't online or updatable as it stands. However, that isn't necessary since all that Bayes depends on are counts which are easily merged.

        That will require a bit of spelunking to figure out where the counts are stored and how to merge them.

        Show
        Ted Dunning added a comment - So sorry... I didn't look enough at context. I was referring to the things that inherit from OnlineLearner. You are correct that Bayes isn't online or updatable as it stands. However, that isn't necessary since all that Bayes depends on are counts which are easily merged. That will require a bit of spelunking to figure out where the counts are stored and how to merge them.
        Hide
        Daniel McEnnis added a comment -

        OnlineLearner interface is now implemented. Is there anything else I need to do before a decision on whether or not this patch will be accepted is made?

        Show
        Daniel McEnnis added a comment - OnlineLearner interface is now implemented. Is there anything else I need to do before a decision on whether or not this patch will be accepted is made?
        Hide
        Sean Owen added a comment -

        Ted, what's the status here - are you comfortable with committing? and, does this patch need updating in order to be good for inclusion?

        Show
        Sean Owen added a comment - Ted, what's the status here - are you comfortable with committing? and, does this patch need updating in order to be good for inclusion?
        Hide
        Ted Dunning added a comment -

        I have been traveling a lot and haven't had a chance to look at this again. I will try again shortly.

        Show
        Ted Dunning added a comment - I have been traveling a lot and haven't had a chance to look at this again. I will try again shortly.
        Hide
        Sean Owen added a comment -

        I assume this all timed out – no work in over a year. A shame but I'm going to close it unless anyone can resurrect with a new patch.

        Show
        Sean Owen added a comment - I assume this all timed out – no work in over a year. A shame but I'm going to close it unless anyone can resurrect with a new patch.
        Hide
        Philip Nadeau added a comment -

        Is this still abandoned? Is it worth picking up for the sake of making a contribution?

        Show
        Philip Nadeau added a comment - Is this still abandoned? Is it worth picking up for the sake of making a contribution?
        Hide
        Ted Dunning added a comment -

        Contributions without user demand isn't necessarily a good thing.

        I still have some qualms about this code fitting in and about performance.

        See also https://github.com/tdunning/knn for a very different take on this problem that focuses on query performance, but which ignores the issues of flexibility of metric.

        Show
        Ted Dunning added a comment - Contributions without user demand isn't necessarily a good thing. I still have some qualms about this code fitting in and about performance. See also https://github.com/tdunning/knn for a very different take on this problem that focuses on query performance, but which ignores the issues of flexibility of metric.
        Hide
        Raimon Bosch added a comment -

        Could you include also the changes classify-20newsgroups.sh in the patch?

        Show
        Raimon Bosch added a comment - Could you include also the changes classify-20newsgroups.sh in the patch?
        Hide
        Sebastian Schelter added a comment -

        Moving this to the backlog

        Show
        Sebastian Schelter added a comment - Moving this to the backlog
        Hide
        Dan Filimon added a comment -

        Perhaps this should be looked at soon especially since parts of Ted's knn repo will soon start landing in Mahout.
        Notably, there is an open issue for a bunch of nearest-neighbor searchers: https://issues.apache.org/jira/browse/MAHOUT-1156

        Show
        Dan Filimon added a comment - Perhaps this should be looked at soon especially since parts of Ted's knn repo will soon start landing in Mahout. Notably, there is an open issue for a bunch of nearest-neighbor searchers: https://issues.apache.org/jira/browse/MAHOUT-1156
        Hide
        Ted Dunning added a comment -

        I don't think so. The searcher interface makes MAHOUT-668 kind of
        superfluous

        Show
        Ted Dunning added a comment - I don't think so. The searcher interface makes MAHOUT-668 kind of superfluous
        Hide
        Dan Filimon added a comment -

        Well, first off, yes, it makes the nearest-neighbors part superfluous. By the same token, there is a NearestNUserNeighborhood class in o.a.m.cf.taste.impl.neighborhood that could probably be replaced.

        But, what I mean is that the bigger picture is using nearest-neighbors for classification in some principled way, isn't it?
        Ted, you actually asked me to test that: building distance vectors a point to each cluster and then applying the e^(-d^2) transform and applying logistic regression is like using radial basis functions then logistic regression.

        Wouldn't it be useful to have code in Mahout that does this directly rather than going through the entire process manually?

        Now, I don't know whether this particular patch can be adapted to use whatever code Mahout now has easily (it might be that the code has sadly rotted). But, feature-wise, it seems useful.

        Show
        Dan Filimon added a comment - Well, first off, yes, it makes the nearest-neighbors part superfluous. By the same token, there is a NearestNUserNeighborhood class in o.a.m.cf.taste.impl.neighborhood that could probably be replaced. But, what I mean is that the bigger picture is using nearest-neighbors for classification in some principled way, isn't it? Ted, you actually asked me to test that: building distance vectors a point to each cluster and then applying the e^(-d^2) transform and applying logistic regression is like using radial basis functions then logistic regression. Wouldn't it be useful to have code in Mahout that does this directly rather than going through the entire process manually? Now, I don't know whether this particular patch can be adapted to use whatever code Mahout now has easily (it might be that the code has sadly rotted). But, feature-wise, it seems useful.

          People

          • Assignee:
            Unassigned
            Reporter:
            Daniel McEnnis
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 672h
              672h
              Remaining:
              Remaining Estimate - 672h
              672h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development