Mahout
  1. Mahout
  2. MAHOUT-943

Improbe the way to make the split point on DF.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Classification
    • Labels:

      Description

      The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.

      But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.

      1. MAHOUT-943.patch
        29 kB
        Ikumasa Mukai

        Issue Links

          Activity

          Hide
          Deneche A. Hakim added a comment -

          You can inherit from IgSplit and provide your own implementation. But we'll need to be able to tell BuildForest which implementation to use

          Show
          Deneche A. Hakim added a comment - You can inherit from IgSplit and provide your own implementation. But we'll need to be able to tell BuildForest which implementation to use
          Hide
          Ikumasa Mukai added a comment -

          Thank you for your advice.
          Yes, I have checked the way to implement and agree with you to make a option for BuildForest for switching.

          I will post a patch if it will be done here!

          Show
          Ikumasa Mukai added a comment - Thank you for your advice. Yes, I have checked the way to implement and agree with you to make a option for BuildForest for switching. I will post a patch if it will be done here!
          Hide
          Wang Yue added a comment -

          Hi, Mukai
          Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method
          "
          private static double variance(double[] s, double[] ss, double[] dataSize) {
          double var = 0;
          for (int i = 0; i < s.length; i++) {
          if (dataSize[i] > 0)

          { var += ss[i] - ((s[i] * s[i]) / dataSize[i]); }

          }
          return var;
          }
          "

          While the variance in my mind should be something like
          var += ss[i]/dataSize[i] - ((s[i] * s[i]) / dataSize[i]*dataSize[i]);

          Please help correct me if I am wrong. Thanks

          Show
          Wang Yue added a comment - Hi, Mukai Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method " private static double variance(double[] s, double[] ss, double[] dataSize) { double var = 0; for (int i = 0; i < s.length; i++) { if (dataSize [i] > 0) { var += ss[i] - ((s[i] * s[i]) / dataSize[i]); } } return var; } " While the variance in my mind should be something like var += ss [i] /dataSize [i] - ((s [i] * s [i] ) / dataSize [i] *dataSize [i] ); Please help correct me if I am wrong. Thanks
          Hide
          Ted Dunning added a comment -

          Also, that isn't a particularly good way to compute variance in the first place.

          Better to use Welford's method. Better, use something like the OnlineSummarizer.

          Show
          Ted Dunning added a comment - Also, that isn't a particularly good way to compute variance in the first place. Better to use Welford's method. Better, use something like the OnlineSummarizer.
          Hide
          Sean Owen added a comment -

          Or RunningAverageAndStdDev does this too

          Show
          Sean Owen added a comment - Or RunningAverageAndStdDev does this too
          Hide
          Ikumasa Mukai added a comment -

          Thank you for your comments.
          I will check existing methods and post the way to fix asap.

          Show
          Ikumasa Mukai added a comment - Thank you for your comments. I will check existing methods and post the way to fix asap.
          Hide
          Ikumasa Mukai added a comment -

          I posted a patch for Regressionsplit.java on MAHOUT-945
          because this issue (943) is for classification method.

          Show
          Ikumasa Mukai added a comment - I posted a patch for Regressionsplit.java on MAHOUT-945 because this issue (943) is for classification method.
          Hide
          Ikumasa Mukai added a comment -

          I made a patch.

          Following Deneche-san's advice, I added a mechanism to change the config of TreeBuilder with xml.

          <?xml version="1.0"?>
          <configuration>
            <treeBuilder class="org.apache.mahout.classifier.df.builder.DecisionTreeBuilder">
              <igSplit class="org.apache.mahout.classifier.df.split.ClassificationSplit"/>
              <m>5</m>
            </treeBuilder>
          </configuration>
          

          ClassificationSplit class is a sample splitter which uses the average value for the point.

          ./hadoop jar $MAHOUT_HOME/mahout-examples-0.6-SNAPSHOT-job.jar \
          org.apache.mahout.classifier.df.mapreduce.BuildForest \
          -Dmapred.max.split.size=1874231 \
          -d $KDD_DATA/KDDTrain.data \
          -ds $KDD_DATA/KDDTrain+.info \
          -c $MAHOUT_HOME/conf/df-config.xml \
          -p -t 100 -o $KDD_DATA/model
          

          I added "-c" param on BuildForest. This param should pointto the conf(XML) file.

          Show
          Ikumasa Mukai added a comment - I made a patch. Following Deneche-san's advice, I added a mechanism to change the config of TreeBuilder with xml. <?xml version="1.0"?> <configuration> <treeBuilder class="org.apache.mahout.classifier.df.builder.DecisionTreeBuilder"> <igSplit class="org.apache.mahout.classifier.df.split.ClassificationSplit"/> <m>5</m> </treeBuilder> </configuration> ClassificationSplit class is a sample splitter which uses the average value for the point. ./hadoop jar $MAHOUT_HOME/mahout-examples-0.6-SNAPSHOT-job.jar \ org.apache.mahout.classifier.df.mapreduce.BuildForest \ -Dmapred.max.split.size=1874231 \ -d $KDD_DATA/KDDTrain.data \ -ds $KDD_DATA/KDDTrain+.info \ -c $MAHOUT_HOME/conf/df-config.xml \ -p -t 100 -o $KDD_DATA/model I added "-c" param on BuildForest. This param should pointto the conf(XML) file.
          Hide
          Robin Anil added a comment -

          Deneche A. Hakim can you see if this can go in and/or resolve it appropriately

          Show
          Robin Anil added a comment - Deneche A. Hakim can you see if this can go in and/or resolve it appropriately
          Hide
          Suneel Marthi added a comment -

          Superseded by fix for Mahout-1419.

          Show
          Suneel Marthi added a comment - Superseded by fix for Mahout-1419.

            People

            • Assignee:
              Deneche A. Hakim
              Reporter:
              Ikumasa Mukai
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development