Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1
    • Fix Version/s: 0.3
    • Component/s: Classification
    • Labels:
      None

      Description

      Please find attached a first sketch for perceptron and winnow training. Please look very, very carefully at the patch, as I added the heart of the algorithms in the emergency room at Charite Berlin (after I broke my leg when cycling to the Hadoop Get Together ).

      The patch does not yet feature unit tests nor is it parallelised. Currently my plan is to set up an example with the webKb dataset, add unit tests to the code and after that go parallel. I would like to get some feedback early on, in addition I would feel a lot better, if a second and third pair of eyes had a look at the code to make sure all obvious mistakes are out as early as possible.

      1. MAHOUT-85.patch
        32 kB
        Isabel Drost-Fromm
      2. MAHOUT-85.patch
        32 kB
        Isabel Drost-Fromm
      3. perceptronWinnowTrainer.diff
        11 kB
        Isabel Drost-Fromm

        Activity

        Hide
        Isabel Drost-Fromm added a comment -

        No, sorry. That was me committing a change that I made for MAHOUT-240 - reverted it.

        So far there are no Driver programs yet: This is only the sequential version. The model should be stored after training and loaded at application time. I have deferred implementing an end-to-end example to MAHOUT-241. Currently the implementation only provides for the training logic.

        Show
        Isabel Drost-Fromm added a comment - No, sorry. That was me committing a change that I made for MAHOUT-240 - reverted it. So far there are no Driver programs yet: This is only the sequential version. The model should be stored after training and loaded at application time. I have deferred implementing an end-to-end example to MAHOUT-241 . Currently the implementation only provides for the training logic.
        Hide
        Grant Ingersoll added a comment -

        Why is PerceptronTrainingMapper empty? Are there Driver programs for this? How do you use the model once it is trained?

        Show
        Grant Ingersoll added a comment - Why is PerceptronTrainingMapper empty? Are there Driver programs for this? How do you use the model once it is trained?
        Hide
        Isabel Drost-Fromm added a comment -

        Finally committed.

        Show
        Isabel Drost-Fromm added a comment - Finally committed.
        Hide
        Sean Owen added a comment -

        Isabel do you think this is 'close enough' to put in the code base? I can submit? As you say, further issues could be tracked separately.

        Show
        Sean Owen added a comment - Isabel do you think this is 'close enough' to put in the code base? I can submit? As you say, further issues could be tracked separately.
        Hide
        Isabel Drost-Fromm added a comment -

        The patch has tests added to the implementation. The additional abstraction proposed earlier is integrated. Distance measure is not configurable but corresponds to what was defined in the original algorithm formulations.

        The implementation currently is sequential-only. Still evaluating, if and how is might be possible to parallelize.

        Missing so far: An example showing how to use training, how to store the resulting model and how to apply the model. Probably should be done in a new issue to keep this one focused on the algorithm itself. In addition I still have to at least add links from our wiki to the wikipedia pages on both algorithms.

        (Had some time left during the past few days: Screws in my knee are out now )

        Show
        Isabel Drost-Fromm added a comment - The patch has tests added to the implementation. The additional abstraction proposed earlier is integrated. Distance measure is not configurable but corresponds to what was defined in the original algorithm formulations. The implementation currently is sequential-only. Still evaluating, if and how is might be possible to parallelize. Missing so far: An example showing how to use training, how to store the resulting model and how to apply the model. Probably should be done in a new issue to keep this one focused on the algorithm itself. In addition I still have to at least add links from our wiki to the wikipedia pages on both algorithms. (Had some time left during the past few days: Screws in my knee are out now )
        Hide
        Isabel Drost-Fromm added a comment -

        The patch has tests added to the implementation. The additional abstraction proposed earlier is integrated. Distance measure is not configurable but corresponds to what was defined in the original algorithm formulations.

        The implementation currently is sequential-only. Still evaluating, if and how is might be possible to parallelize.

        Missing so far: An example showing how to use training, how to store the resulting model and how to apply the model. Probably should be done in a new issue to keep this one focused on the algorithm itself. In addition I still have to at least add links from our wiki to the wikipedia pages on both algorithms.

        (Had some time left during the past few days: Screws in my knee are out now )

        Show
        Isabel Drost-Fromm added a comment - The patch has tests added to the implementation. The additional abstraction proposed earlier is integrated. Distance measure is not configurable but corresponds to what was defined in the original algorithm formulations. The implementation currently is sequential-only. Still evaluating, if and how is might be possible to parallelize. Missing so far: An example showing how to use training, how to store the resulting model and how to apply the model. Probably should be done in a new issue to keep this one focused on the algorithm itself. In addition I still have to at least add links from our wiki to the wikipedia pages on both algorithms. (Had some time left during the past few days: Screws in my knee are out now )
        Hide
        Isabel Drost-Fromm added a comment -

        I am about to add tests currently. I guess, I will commit once I have those done and go on with a parallel version from there.

        Show
        Isabel Drost-Fromm added a comment - I am about to add tests currently. I guess, I will commit once I have those done and go on with a parallel version from there.
        Hide
        Sean Owen added a comment -

        Sure, worth committing or shelving, you think? Just trying to review all the old issues that haven't seen activity in a year or so.

        Show
        Sean Owen added a comment - Sure, worth committing or shelving, you think? Just trying to review all the old issues that haven't seen activity in a year or so.
        Hide
        Isabel Drost-Fromm added a comment -

        It is just a sequential version of the algorithm. No parallelisation and no Hadoop involved.

        Show
        Isabel Drost-Fromm added a comment - It is just a sequential version of the algorithm. No parallelisation and no Hadoop involved.
        Hide
        Sean Owen added a comment -

        More housekeeping for 0.3. Is this still pretty commitable? I'd go for it if you think it's basically sound.

        Show
        Sean Owen added a comment - More housekeeping for 0.3. Is this still pretty commitable? I'd go for it if you think it's basically sound.
        Hide
        Isabel Drost-Fromm added a comment -

        Thanks for the comments. I will try to incorporate these in the next version of the patch.

        Show
        Isabel Drost-Fromm added a comment - Thanks for the comments. I will try to incorporate these in the next version of the patch.
        Hide
        Karthik K added a comment -

        Would it be better to add another ctor. with the distance measure as a configurable parameter (with cosine being retained as the default measure ).

        Also - regarding LinearModel ( member: Vector and methods: add (Vector delta) , timesDelta(Vector delta) ) - can having additional abstraction of a HyperPlane (with Vector as members and addDelta / timesDelta / distance as methods to it ). That might be more cleaner , since theoretically we define a LinearModel to be a HyperPlane with a specific DistanceMeasure and perform classification on the same, and adding / scaling the hyperplane vector is better consolidated separately as opposed to the LinearModel itself.

        Show
        Karthik K added a comment - Would it be better to add another ctor. with the distance measure as a configurable parameter (with cosine being retained as the default measure ). Also - regarding LinearModel ( member: Vector and methods: add (Vector delta) , timesDelta(Vector delta) ) - can having additional abstraction of a HyperPlane (with Vector as members and addDelta / timesDelta / distance as methods to it ). That might be more cleaner , since theoretically we define a LinearModel to be a HyperPlane with a specific DistanceMeasure and perform classification on the same, and adding / scaling the hyperplane vector is better consolidated separately as opposed to the LinearModel itself.
        Hide
        Isabel Drost-Fromm added a comment -

        The attachment mentioned above.

        Show
        Isabel Drost-Fromm added a comment - The attachment mentioned above.

          People

          • Assignee:
            Isabel Drost-Fromm
            Reporter:
            Isabel Drost-Fromm
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development