Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.5
    • Fix Version/s: None
    • Component/s: Classification
    • Labels:

      Description

      Implement boosting (grad boost variant) with l1-regularization and induction.
      The gradient part is scalable and parallel and the induction part allows stochastic hypothesis generation for speed.

      1. MAHOUT-716.patch
        25 kB
        Hector Yee
      2. MAHOUT-716.patch
        24 kB
        Sean Owen

        Activity

        Hide
        Hector Yee added a comment -

        Implement boosting with decision stumps trained by Gradboost. For technical details see "Boosting with structural sparsity" by Duchi and Singer.

        Show
        Hector Yee added a comment - Implement boosting with decision stumps trained by Gradboost. For technical details see "Boosting with structural sparsity" by Duchi and Singer.
        Hide
        Hector Yee added a comment -

        Any feedback on this? I am planning to build more stuff on this that is specific to adaboost such as a debugging tool for the stumps, tools to plot PR curves etc.

        Show
        Hector Yee added a comment - Any feedback on this? I am planning to build more stuff on this that is specific to adaboost such as a debugging tool for the stumps, tools to plot PR curves etc.
        Hide
        Sean Owen added a comment -

        Let me let Ted comment on this, but more small-scale feedback. This patch definitely needs a fair bit more scrub and I will post my version of it.

        • java.util.Vector is so 2000. Definitely want to use List/ArrayList
        • You definitely don't want to build strings by new String() and concatentation – try StringBuilder
        • equals() must take Object as a param or it is not actually implemented. If you use @Override, the compiler will flag this sort of error.
        • You absolutely must implement hashCode() when implementing equals()
        • Use Java 5 foreach where you can
        • Avoid non-private fields; use protected getters if you want
        • 0 is an int, 0.0 is a double, and while Java silently converts I personally prefer being explicit about intent by using 0.0 where a double value is meant
        • There are a load of unused imports, including weird ones like the JavacCompiler
        • Don't catch Exception

        I strongly encourage anyone to just go get the free version of IntelliJ. Turn on every one of its code inspection settings. Every source file will have like 100 things flagged. Slowly turn off the rules you don't like or that don't apply. You will be left with a rule set that flags all of this stuff instantly as you look at a file. It's just like night and day to have all this stuff literally jump out at you and be fixable with one click

        (I am happy to share my personal ruleset which is pretty standard)

        Show
        Sean Owen added a comment - Let me let Ted comment on this, but more small-scale feedback. This patch definitely needs a fair bit more scrub and I will post my version of it. java.util.Vector is so 2000. Definitely want to use List/ArrayList You definitely don't want to build strings by new String() and concatentation – try StringBuilder equals() must take Object as a param or it is not actually implemented. If you use @Override, the compiler will flag this sort of error. You absolutely must implement hashCode() when implementing equals() Use Java 5 foreach where you can Avoid non-private fields; use protected getters if you want 0 is an int, 0.0 is a double, and while Java silently converts I personally prefer being explicit about intent by using 0.0 where a double value is meant There are a load of unused imports, including weird ones like the JavacCompiler Don't catch Exception I strongly encourage anyone to just go get the free version of IntelliJ. Turn on every one of its code inspection settings. Every source file will have like 100 things flagged. Slowly turn off the rules you don't like or that don't apply. You will be left with a rule set that flags all of this stuff instantly as you look at a file. It's just like night and day to have all this stuff literally jump out at you and be fixable with one click (I am happy to share my personal ruleset which is pretty standard)
        Hide
        Hector Yee added a comment -

        Thanks for cleaning it up! I'm just starting to re-learn java so this is neat!

        Show
        Hector Yee added a comment - Thanks for cleaning it up! I'm just starting to re-learn java so this is neat!
        Hide
        Ted Dunning added a comment -

        Hector, I can't get to this right now because of other issues (traveling, at a conference and so on).

        Don't let me slow you down. Clone a git repo and go forward assuming success. We will fix up the patches
        progressively and you can rebase later developments against trunk as things go along.

        You should be able to make progress at full speed this way and the only delays will be a slight delay in adoption
        of improvements by others. The uptake of new methods are bounded by social factors in any case and will not happen
        on the same time scale in any case.

        Show
        Ted Dunning added a comment - Hector, I can't get to this right now because of other issues (traveling, at a conference and so on). Don't let me slow you down. Clone a git repo and go forward assuming success. We will fix up the patches progressively and you can rebase later developments against trunk as things go along. You should be able to make progress at full speed this way and the only delays will be a slight delay in adoption of improvements by others. The uptake of new methods are bounded by social factors in any case and will not happen on the same time scale in any case.
        Hide
        Hector Yee added a comment -

        Yeah I forked a git repo on git hub, it should be much easier to manage than Subversion.

        Show
        Hector Yee added a comment - Yeah I forked a git repo on git hub, it should be much easier to manage than Subversion.
        Hide
        Hector Yee added a comment -

        Any news on this patch?

        Show
        Hector Yee added a comment - Any news on this patch?
        Hide
        Sean Owen added a comment -

        Last we left, I had sent over my version of the patch. Do you have a further update?
        I think it's down to Ted being able to take a look then.

        Show
        Sean Owen added a comment - Last we left, I had sent over my version of the patch. Do you have a further update? I think it's down to Ted being able to take a look then.
        Hide
        Hector Yee added a comment -

        Nope, waiting on Ted's feedback.

        Show
        Hector Yee added a comment - Nope, waiting on Ted's feedback.
        Hide
        Sean Owen added a comment -

        Not sure what has happened here – Ted it was waiting on you?

        Show
        Sean Owen added a comment - Not sure what has happened here – Ted it was waiting on you?
        Hide
        Isabel Drost-Fromm added a comment -

        After not much activity - took a brief look at the patch. Some comments (to be taken with a grain of salt after I didn't have the cycles to follow the project as closely as I would have liked in the past months):

        You mentioned a forked git repo on github - is it still online?

        So far looks like a rather isolated change. Would it make sense to integrate it with existing classification APIs e.g. org.apache.mahout.classifier.AbstractVectorClassifier?

        Also some more documentation and a usage example for the un-initiated would be great: In addition to links to one or two publications the implementation is based on it's always great to have some information on the strengths and weaknesses of the implemented solution (yes, I know we are doing pretty badly along these lines with other bits and pieces we have - still would be nice to have).

        Show
        Isabel Drost-Fromm added a comment - After not much activity - took a brief look at the patch. Some comments (to be taken with a grain of salt after I didn't have the cycles to follow the project as closely as I would have liked in the past months): You mentioned a forked git repo on github - is it still online? So far looks like a rather isolated change. Would it make sense to integrate it with existing classification APIs e.g. org.apache.mahout.classifier.AbstractVectorClassifier? Also some more documentation and a usage example for the un-initiated would be great: In addition to links to one or two publications the implementation is based on it's always great to have some information on the strengths and weaknesses of the implemented solution (yes, I know we are doing pretty badly along these lines with other bits and pieces we have - still would be nice to have).
        Hide
        Hector Yee added a comment -

        Thanks for the review Isabel.

        The git used to be at https://github.com/klout/mahout_delete

        The paper it is based on is here: http://www.cs.berkeley.edu/~jduchi/projects/DuchiSi09_boost.html

        I'm hesitant to make changes to it as now I would have to get google's approval to re-submit

        Show
        Hector Yee added a comment - Thanks for the review Isabel. The git used to be at https://github.com/klout/mahout_delete The paper it is based on is here: http://www.cs.berkeley.edu/~jduchi/projects/DuchiSi09_boost.html I'm hesitant to make changes to it as now I would have to get google's approval to re-submit
        Hide
        Sebastian Schelter added a comment -

        moving this to the backlog. Hector Yee if you find time address the comments, would be great to have this in a future release

        Show
        Sebastian Schelter added a comment - moving this to the backlog. Hector Yee if you find time address the comments, would be great to have this in a future release

          People

          • Assignee:
            Ted Dunning
            Reporter:
            Hector Yee
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 72h
              72h
              Remaining:
              Remaining Estimate - 72h
              72h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development