Mahout
  1. Mahout
  2. MAHOUT-943

Improbe the way to make the split point on DF.

    Details

      Description

      The numericalSplit() on OptIgSplit adopts the way to regard the attribute value having the best IG as the split point.

      But I think this is a little too strict and think it is better on some situation to use the average value which is calced with the best IG value and the 2nd value.

      1. MAHOUT-943.patch
        29 kB
        Ikumasa Mukai

        Issue Links

          Activity

          Hide
          Suneel Marthi added a comment -

          Superseded by fix for Mahout-1419.

          Show
          Suneel Marthi added a comment - Superseded by fix for Mahout-1419.
          Hide
          Robin Anil added a comment -

          Deneche A. Hakim can you see if this can go in and/or resolve it appropriately

          Show
          Robin Anil added a comment - Deneche A. Hakim can you see if this can go in and/or resolve it appropriately
          Hide
          Ikumasa Mukai added a comment -

          I made a patch.

          Following Deneche-san's advice, I added a mechanism to change the config of TreeBuilder with xml.

          <?xml version="1.0"?>
          <configuration>
            <treeBuilder class="org.apache.mahout.classifier.df.builder.DecisionTreeBuilder">
              <igSplit class="org.apache.mahout.classifier.df.split.ClassificationSplit"/>
              <m>5</m>
            </treeBuilder>
          </configuration>
          

          ClassificationSplit class is a sample splitter which uses the average value for the point.

          ./hadoop jar $MAHOUT_HOME/mahout-examples-0.6-SNAPSHOT-job.jar \
          org.apache.mahout.classifier.df.mapreduce.BuildForest \
          -Dmapred.max.split.size=1874231 \
          -d $KDD_DATA/KDDTrain.data \
          -ds $KDD_DATA/KDDTrain+.info \
          -c $MAHOUT_HOME/conf/df-config.xml \
          -p -t 100 -o $KDD_DATA/model
          

          I added "-c" param on BuildForest. This param should pointto the conf(XML) file.

          Show
          Ikumasa Mukai added a comment - I made a patch. Following Deneche-san's advice, I added a mechanism to change the config of TreeBuilder with xml. <?xml version="1.0"?> <configuration> <treeBuilder class="org.apache.mahout.classifier.df.builder.DecisionTreeBuilder"> <igSplit class="org.apache.mahout.classifier.df.split.ClassificationSplit"/> <m>5</m> </treeBuilder> </configuration> ClassificationSplit class is a sample splitter which uses the average value for the point. ./hadoop jar $MAHOUT_HOME/mahout-examples-0.6-SNAPSHOT-job.jar \ org.apache.mahout.classifier.df.mapreduce.BuildForest \ -Dmapred.max.split.size=1874231 \ -d $KDD_DATA/KDDTrain.data \ -ds $KDD_DATA/KDDTrain+.info \ -c $MAHOUT_HOME/conf/df-config.xml \ -p -t 100 -o $KDD_DATA/model I added "-c" param on BuildForest. This param should pointto the conf(XML) file.
          Hide
          Ikumasa Mukai added a comment -

          I posted a patch for Regressionsplit.java on MAHOUT-945
          because this issue (943) is for classification method.

          Show
          Ikumasa Mukai added a comment - I posted a patch for Regressionsplit.java on MAHOUT-945 because this issue (943) is for classification method.
          Hide
          Ikumasa Mukai added a comment -

          Thank you for your comments.
          I will check existing methods and post the way to fix asap.

          Show
          Ikumasa Mukai added a comment - Thank you for your comments. I will check existing methods and post the way to fix asap.
          Hide
          Sean Owen added a comment -

          Or RunningAverageAndStdDev does this too

          Show
          Sean Owen added a comment - Or RunningAverageAndStdDev does this too
          Hide
          Ted Dunning added a comment -

          Also, that isn't a particularly good way to compute variance in the first place.

          Better to use Welford's method. Better, use something like the OnlineSummarizer.

          Show
          Ted Dunning added a comment - Also, that isn't a particularly good way to compute variance in the first place. Better to use Welford's method. Better, use something like the OnlineSummarizer.
          Hide
          Wang Yue added a comment -

          Hi, Mukai
          Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method
          "
          private static double variance(double[] s, double[] ss, double[] dataSize) {
          double var = 0;
          for (int i = 0; i < s.length; i++) {
          if (dataSize[i] > 0)

          { var += ss[i] - ((s[i] * s[i]) / dataSize[i]); }

          }
          return var;
          }
          "

          While the variance in my mind should be something like
          var += ss[i]/dataSize[i] - ((s[i] * s[i]) / dataSize[i]*dataSize[i]);

          Please help correct me if I am wrong. Thanks

          Show
          Wang Yue added a comment - Hi, Mukai Thanks for your efforts in expand the RF to regression. However, I have a doubt about your implementation regarding to Regressionsplit.java. The variance method " private static double variance(double[] s, double[] ss, double[] dataSize) { double var = 0; for (int i = 0; i < s.length; i++) { if (dataSize [i] > 0) { var += ss[i] - ((s[i] * s[i]) / dataSize[i]); } } return var; } " While the variance in my mind should be something like var += ss [i] /dataSize [i] - ((s [i] * s [i] ) / dataSize [i] *dataSize [i] ); Please help correct me if I am wrong. Thanks
          Hide
          Ikumasa Mukai added a comment -

          Thank you for your advice.
          Yes, I have checked the way to implement and agree with you to make a option for BuildForest for switching.

          I will post a patch if it will be done here!

          Show
          Ikumasa Mukai added a comment - Thank you for your advice. Yes, I have checked the way to implement and agree with you to make a option for BuildForest for switching. I will post a patch if it will be done here!
          Hide
          Deneche A. Hakim added a comment -

          You can inherit from IgSplit and provide your own implementation. But we'll need to be able to tell BuildForest which implementation to use

          Show
          Deneche A. Hakim added a comment - You can inherit from IgSplit and provide your own implementation. But we'll need to be able to tell BuildForest which implementation to use

            People

            • Assignee:
              Deneche A. Hakim
              Reporter:
              Ikumasa Mukai
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development