Mahout
  1. Mahout
  2. MAHOUT-840

Decision Forests should support Regression problems

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: Classification
    • Labels:
      None

      Description

      Improve Decision Forest code in order to handle numerical targets, thus supporting regression problems

      1. regression.patch
        21 kB
        Ikumasa Mukai
      2. regression.patch
        43 kB
        Ikumasa Mukai
      3. regression.patch
        47 kB
        Ikumasa Mukai
      4. MAHOUT-840-additional.patch
        14 kB
        Ikumasa Mukai
      5. MAHOUT-840.patch
        106 kB
        Ikumasa Mukai
      6. DecisionTreeBuilderTest.java
        1 kB
        Ikumasa Mukai

        Issue Links

          Activity

          Hide
          Deneche A. Hakim added a comment -

          DecisionForests use the Dataset class to get informations about the attributes: whether they are numerical or categorical, and to get all possible values of categorical attributes. The target attribute is treated apart and supposed always to be categorical. The first step should be to modify Dataset in order to treat the target attribute as any other attribute, of course the remaining code still asumes it is categorical

          Show
          Deneche A. Hakim added a comment - DecisionForests use the Dataset class to get informations about the attributes: whether they are numerical or categorical, and to get all possible values of categorical attributes. The target attribute is treated apart and supposed always to be categorical. The first step should be to modify Dataset in order to treat the target attribute as any other attribute, of course the remaining code still asumes it is categorical
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1113 (See https://builds.apache.org/job/Mahout-Quality/1113/)
          MAHOUT-840 target attribute can now be numerical, although regression is still not supported

          adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187953
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/builder/DefaultTreeBuilder.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Data.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DataConverter.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DataLoader.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Dataset.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Instance.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/mapreduce/Classifier.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/node/MockLeaf.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/node/Node.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/split/OptIgSplit.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/Describe.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/FrequenciesJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/UDistrib.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/builder/InfiniteRecursionTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataConverterTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataLoaderTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DatasetTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/Utils.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/mapreduce/partial/Step1MapperTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/split/DefaultIgSplitTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/split/OptIgSplitTest.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/df/BreimanExample.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/df/mapreduce/TestForest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1113 (See https://builds.apache.org/job/Mahout-Quality/1113/ ) MAHOUT-840 target attribute can now be numerical, although regression is still not supported adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187953 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/df/builder/DefaultTreeBuilder.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Data.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DataConverter.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DataLoader.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Dataset.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Instance.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/mapreduce/Classifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/node/MockLeaf.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/node/Node.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/split/OptIgSplit.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/Describe.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/FrequenciesJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/UDistrib.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/builder/InfiniteRecursionTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataConverterTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataLoaderTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DatasetTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/Utils.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/mapreduce/partial/Step1MapperTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/split/DefaultIgSplitTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/split/OptIgSplitTest.java /mahout/trunk/examples/src/main/java/org/apache/mahout/df/BreimanExample.java /mahout/trunk/examples/src/main/java/org/apache/mahout/df/mapreduce/TestForest.java
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1116 (See https://builds.apache.org/job/Mahout-Quality/1116/)
          MAHOUT-840 Instance.id removed

          adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1188332
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Data.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DataConverter.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DataLoader.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Instance.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/mapreduce/Classifier.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/mapreduce/partial/Step1Mapper.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/FrequenciesJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/UDistrib.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataConverterTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataLoaderTest.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/df/mapreduce/TestForest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1116 (See https://builds.apache.org/job/Mahout-Quality/1116/ ) MAHOUT-840 Instance.id removed adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1188332 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Data.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DataConverter.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/DataLoader.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/data/Instance.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/mapreduce/Classifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/mapreduce/partial/Step1Mapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/FrequenciesJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/df/tools/UDistrib.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataConverterTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/df/data/DataLoaderTest.java /mahout/trunk/examples/src/main/java/org/apache/mahout/df/mapreduce/TestForest.java
          Hide
          Ikumasa Mukai added a comment -

          Hello,
          This is my first comment on jira.

          Now I am trying to implement a TreeBuilder which can also be used for Regression problems with MAHOUT-840 modifications.

          For this, I think some additional modifications are needed (patch attached).
          Could you please check this patch?

          This patch contains these points.
          1) Changed target and prediction params from int to double.
          2) Added the prediction for regression problem (using majority rule if target param is numeric)
          3) Added changes for using numeric target param.
          4) Added changes for detecting the target param type.
          5) Added a func on Dataset for getting the number of category type params.

          cheers

          Show
          Ikumasa Mukai added a comment - Hello, This is my first comment on jira. Now I am trying to implement a TreeBuilder which can also be used for Regression problems with MAHOUT-840 modifications. For this, I think some additional modifications are needed (patch attached). Could you please check this patch? This patch contains these points. 1) Changed target and prediction params from int to double. 2) Added the prediction for regression problem (using majority rule if target param is numeric) 3) Added changes for using numeric target param. 4) Added changes for detecting the target param type. 5) Added a func on Dataset for getting the number of category type params. cheers
          Hide
          Deneche A. Hakim added a comment -

          Thanks for the patch. I tried it but I am getting errors in the following tests:

          testBuild(org.apache.mahout.classifier.df.builder.InfiniteRecursionTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V
          testProcessOutput(org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilderTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V
          testMapper(org.apache.mahout.classifier.df.mapreduce.partial.Step1MapperTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V
          testReadTree(org.apache.mahout.classifier.df.node.NodeTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V
          testReadLeaf(org.apache.mahout.classifier.df.node.NodeTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V
          testParseNumerical(org.apache.mahout.classifier.df.node.NodeTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V

          are you getting those errors too ?

          Show
          Deneche A. Hakim added a comment - Thanks for the patch. I tried it but I am getting errors in the following tests: testBuild(org.apache.mahout.classifier.df.builder.InfiniteRecursionTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V testProcessOutput(org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilderTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V testMapper(org.apache.mahout.classifier.df.mapreduce.partial.Step1MapperTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V testReadTree(org.apache.mahout.classifier.df.node.NodeTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V testReadLeaf(org.apache.mahout.classifier.df.node.NodeTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V testParseNumerical(org.apache.mahout.classifier.df.node.NodeTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V are you getting those errors too ?
          Hide
          Ikumasa Mukai added a comment - - edited

          Thank you for your checking and sorry for errors.

          I am checking them, but cannot get..
          This is test results on my env. (mvn test)


          Running org.apache.mahout.classifier.df.builder.InfiniteRecursionTest
          Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
          Running org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilderTest
          Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
          Running org.apache.mahout.classifier.df.mapreduce.partial.Step1MapperTest
          Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
          Running org.apache.mahout.classifier.df.node.NodeTest
          Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec

          Results :
          Tests run: 40, Failures: 0, Errors: 0, Skipped: 0

          Could you please give me the details?
          Is there a possibility of using old test class which has int call?

          Regards,

          Show
          Ikumasa Mukai added a comment - - edited Thank you for your checking and sorry for errors. I am checking them, but cannot get.. This is test results on my env. (mvn test) — Running org.apache.mahout.classifier.df.builder.InfiniteRecursionTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec Running org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilderTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec Running org.apache.mahout.classifier.df.mapreduce.partial.Step1MapperTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec Running org.apache.mahout.classifier.df.node.NodeTest Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec Results : Tests run: 40, Failures: 0, Errors: 0, Skipped: 0 — Could you please give me the details? Is there a possibility of using old test class which has int call? Regards,
          Hide
          Deneche A. Hakim added a comment -

          Did you make changes to the tests ? the patch you sent doesn't affect any test file. Try getting a new checkout of the trunk in another directory and apply the patch to it, then run the failing tests and see if they work fine.

          Show
          Deneche A. Hakim added a comment - Did you make changes to the tests ? the patch you sent doesn't affect any test file. Try getting a new checkout of the trunk in another directory and apply the patch to it, then run the failing tests and see if they work fine.
          Hide
          Lance Norskog added a comment -

          A couple of points:

          1. (int) on a double means Math.floor(double). Would it make more sense to round up or down?
          2. Leaf.hashCode() needs a matching equals().
            • The equals() should compare with an epsilon.
            • clamping to (int) is probably not a very good hash function

          Please add both a unit test and an example.

          Show
          Lance Norskog added a comment - A couple of points: (int) on a double means Math.floor(double). Would it make more sense to round up or down? Leaf.hashCode() needs a matching equals(). The equals() should compare with an epsilon. clamping to (int) is probably not a very good hash function Please add both a unit test and an example.
          Hide
          Ikumasa Mukai added a comment -

          Hi Norskog-san.
          Thank you very much for reviewing and advising.

          Now I am working and will attach here if done.

          Regarding to the example, would you please tell me about your image?

          Regards,

          Show
          Ikumasa Mukai added a comment - Hi Norskog-san. Thank you very much for reviewing and advising. Now I am working and will attach here if done. Regarding to the example, would you please tell me about your image? Regards,
          Hide
          Ikumasa Mukai added a comment -

          Hi.
          I made a new patch!

          > 1. (int) on a double means Math.floor(double). Would it make more sense to round up or down?

          I think we can (int) because most parts using cast(double->int) are for the index of the categorical attributes. To make them clear, I added comments on on every parts.

          Would you please tell me your thought about this?

          > 2. Leaf.hashCode() needs a matching equals().
          > ・The equals() should compare with an epsilon.
          > ・clamping to (int) is probably not a very good hash function

          I fixed with the way you advised me!

          > 3. Please add both a unit test and an example.

          I added them!

          Regards,

          Show
          Ikumasa Mukai added a comment - Hi. I made a new patch! > 1. (int) on a double means Math.floor(double). Would it make more sense to round up or down? I think we can (int) because most parts using cast(double->int) are for the index of the categorical attributes. To make them clear, I added comments on on every parts. Would you please tell me your thought about this? > 2. Leaf.hashCode() needs a matching equals(). > ・The equals() should compare with an epsilon. > ・clamping to (int) is probably not a very good hash function I fixed with the way you advised me! > 3. Please add both a unit test and an example. I added them! Regards,
          Hide
          Deneche A. Hakim added a comment -

          Hi Ikumasa,

          I am reviewing the patch right now and I wanted to say: good job!

          I noticed that both DataLoader.parseString() and DataConverter.convert() don't take into account that the dataset is in "regression" mode. They treat the label as being categorical !

          Did you try to run BuildForest and TestForest on a regression dataset ?

          Show
          Deneche A. Hakim added a comment - Hi Ikumasa, I am reviewing the patch right now and I wanted to say: good job! I noticed that both DataLoader.parseString() and DataConverter.convert() don't take into account that the dataset is in "regression" mode. They treat the label as being categorical ! Did you try to run BuildForest and TestForest on a regression dataset ?
          Hide
          Ikumasa Mukai added a comment -

          Hi Hakim-san.
          Thank you for reviewing!

          > DataLoader.parseString() and DataConverter.convert() don't take into account that the dataset is in "regression" mode.

          I attached new patch which has the additional modifications for DataLoader.parseString().

          For DataConverter.convert(),
          my old patch has the modification, so no change is made on this.

          > Did you try to run BuildForest and TestForest on a regression dataset ?

          Yes, I have an own TreeBuilder and ResultAnalyzer for regression dataset.

          But I feel it is better that the patch for MAHOUT-840 doesn't have them.
          Is it good?

          Regards,

          Show
          Ikumasa Mukai added a comment - Hi Hakim-san. Thank you for reviewing! > DataLoader.parseString() and DataConverter.convert() don't take into account that the dataset is in "regression" mode. I attached new patch which has the additional modifications for DataLoader.parseString(). For DataConverter.convert(), my old patch has the modification, so no change is made on this. > Did you try to run BuildForest and TestForest on a regression dataset ? Yes, I have an own TreeBuilder and ResultAnalyzer for regression dataset. But I feel it is better that the patch for MAHOUT-840 doesn't have them. Is it good? Regards,
          Hide
          Deneche A. Hakim added a comment -

          Hi Ikumasa

          > I attached new patch which has the additional modifications for DataLoader.parseString().

          Cool, thanks.

          > For DataConverter.convert(),
          > my old patch has the modification, so no change is made on this.

          Ah, I thought your last patch included all the modifications. So should I apply all three patches one after another ?

          > Yes, I have an own TreeBuilder and ResultAnalyzer for regression dataset.
          > But I feel it is better that the patch for MAHOUT-840 doesn't have them.
          > Is it good?

          If you can add them to the patch as well that would be awesome !!!

          Show
          Deneche A. Hakim added a comment - Hi Ikumasa > I attached new patch which has the additional modifications for DataLoader.parseString(). Cool, thanks. > For DataConverter.convert(), > my old patch has the modification, so no change is made on this. Ah, I thought your last patch included all the modifications. So should I apply all three patches one after another ? > Yes, I have an own TreeBuilder and ResultAnalyzer for regression dataset. > But I feel it is better that the patch for MAHOUT-840 doesn't have them. > Is it good? If you can add them to the patch as well that would be awesome !!!
          Hide
          Deneche A. Hakim added a comment -

          > Ah, I thought your last patch included all the modifications. So should I apply all three patches one after
          > another ?

          Ok, the patch already included the modifications to DataConverter, I will review it tomorrow, hopefully you will include TreeBuilder and ResultAnalyzer

          Show
          Deneche A. Hakim added a comment - > Ah, I thought your last patch included all the modifications. So should I apply all three patches one after > another ? Ok, the patch already included the modifications to DataConverter, I will review it tomorrow, hopefully you will include TreeBuilder and ResultAnalyzer
          Hide
          Ikumasa Mukai added a comment -

          Hi Hakim-san.
          Thank you for checking and I'm sorry that you were confused in my poor English.

          I am delighted our codes( TreeBuilder and ResultAnalyzer ) can be shared!

          But may I send them by next week?
          They are not so good quality now because I think it is the next step to share them and I wish to do refactoring them.

          So, It is great if we can fix the latest regression.patch simultaneously.

          Regards & Thanks,

          Show
          Ikumasa Mukai added a comment - Hi Hakim-san. Thank you for checking and I'm sorry that you were confused in my poor English. I am delighted our codes( TreeBuilder and ResultAnalyzer ) can be shared! But may I send them by next week? They are not so good quality now because I think it is the next step to share them and I wish to do refactoring them. So, It is great if we can fix the latest regression.patch simultaneously. Regards & Thanks,
          Hide
          Deneche A. Hakim added a comment -

          No problem Ikumasa, hopefully I will finish reviewing the current patch and get it committed by then.

          Show
          Deneche A. Hakim added a comment - No problem Ikumasa, hopefully I will finish reviewing the current patch and get it committed by then.
          Hide
          Ikumasa Mukai added a comment -

          Hi Hakim-san.

          Sorry for late!
          I add a new patch (MAHOUT-840.patch) which has the new TreeBuilder and more.

          The additions are ..

          1) Added DecisionTreeBuilder

          This class can be used for making the classification and regression tree.
          On making regression tree, this uses the variance.

          And this class has functions for complementing the lacked leaves and preventing the overfitting for both trees.

          For complementing, the parent stem's other leaves are used.
          For Preventing, the number of data on the leaf is used. (for regression tree the value of variance is also checked. )

          2) Added RegressionResultAnalyzer

          This class shows the result like this.

          =======================================================
          Summary
          -------------------------------------------------------
          Correlation coefficient                 :     1.0076
          Mean absolute error                     :     1.8083
          Root mean squared error                 :     2.5944
          Total Regressed Instances               :         50
          

          3) How to use:
          I added "-b" param on the BuildForest for selecting the TreeBuilder class.

           org.apache.mahout.df.mapreduce.BuildForest \
          -Dmapred.max.split.size=1874231 \
          -oob \
          -d $KDD_DATA/KDDTrain+.arff \
          -ds $KDD_DATA/KDDTrain+.info \
          -sl 5 \
          -p \
          -t 100 \
          -b org.apache.mahout.classifier.df.builder.DecisionTreeBuilder
          -o $KDD_DATA/model
          

          For the classification and regression, I tested this patch with visual-test using DecisionTreeBuilderTest.java.
          This class uses the TreePrinter and the ArffDataLoader.

          "The TreePrinter" can be used for making the model data visible like this.

          i. iris - classification

          petallength < 3.3 : Iris-setosa
          petallength >= 3.3
          |   petalwidth < 1.8
          |   |   petallength < 5
          |   |   |   petalwidth < 1.7 : Iris-versicolor
          |   |   |   petalwidth >= 1.7 : Iris-virginica
          |   |   petallength >= 5
          |   |   |   petalwidth < 1.6 : Iris-virginica
          |   |   |   petalwidth >= 1.6
          |   |   |   |   sepallength < 7.2 : Iris-versicolor
          |   |   |   |   sepallength >= 7.2 : Iris-virginica
          |   petalwidth >= 1.8
          |   |   petallength < 4.9
          |   |   |   sepallength < 6 : Iris-versicolor
          |   |   |   sepallength >= 6 : Iris-virginica
          |   |   petallength >= 4.9 : Iris-virginica
          

          ii. cars - regression

          speed < 30
          |   speed < 12
          |   |   speed < 3 : 4
          |   |   speed >= 3
          |   |   |   speed < 7 : 7
          |   |   |   speed >= 7 : 6.5
          |   speed >= 12
          |   |   speed < 23
          |   |   |   speed < 21
          |   |   |   |   speed < 19
          |   |   |   |   |   speed < 15 : 12
          |   |   |   |   |   speed >= 15
          |   |   |   |   |   |   speed < 16.5 : 8
          |   |   |   |   |   |   speed >= 16.5
          |   |   |   |   |   |   |   speed < 17.5 : 11
          |   |   |   |   |   |   |   speed >= 17.5 : 10
          |   |   |   |   speed >= 19 : 13.5
          |   |   |   speed >= 21 : 7
          |   |   speed >= 23
          |   |   |   speed < 27
          |   |   |   |   speed < 25 : 12
          |   |   |   |   speed >= 25 : 13
          |   |   |   speed >= 27 : 11.5
          speed >= 30
          |   speed < 84.5
          ---snip---
          

          And "the ArffDataLoader" can read ARFF format data file ans is good for making the test easy.

          These 2 additions are contained on the last regression.patch.

          Regards,

          Show
          Ikumasa Mukai added a comment - Hi Hakim-san. Sorry for late! I add a new patch ( MAHOUT-840 .patch) which has the new TreeBuilder and more. The additions are .. 1) Added DecisionTreeBuilder This class can be used for making the classification and regression tree. On making regression tree, this uses the variance. And this class has functions for complementing the lacked leaves and preventing the overfitting for both trees. For complementing, the parent stem's other leaves are used. For Preventing, the number of data on the leaf is used. (for regression tree the value of variance is also checked. ) 2) Added RegressionResultAnalyzer This class shows the result like this. ======================================================= Summary ------------------------------------------------------- Correlation coefficient : 1.0076 Mean absolute error : 1.8083 Root mean squared error : 2.5944 Total Regressed Instances : 50 3) How to use: I added "-b" param on the BuildForest for selecting the TreeBuilder class.  org.apache.mahout.df.mapreduce.BuildForest \ -Dmapred.max.split.size=1874231 \ -oob \ -d $KDD_DATA/KDDTrain+.arff \ -ds $KDD_DATA/KDDTrain+.info \ -sl 5 \ -p \ -t 100 \ -b org.apache.mahout.classifier.df.builder.DecisionTreeBuilder -o $KDD_DATA/model For the classification and regression, I tested this patch with visual-test using DecisionTreeBuilderTest.java. This class uses the TreePrinter and the ArffDataLoader. "The TreePrinter" can be used for making the model data visible like this. i. iris - classification petallength < 3.3 : Iris-setosa petallength >= 3.3 | petalwidth < 1.8 | | petallength < 5 | | | petalwidth < 1.7 : Iris-versicolor | | | petalwidth >= 1.7 : Iris-virginica | | petallength >= 5 | | | petalwidth < 1.6 : Iris-virginica | | | petalwidth >= 1.6 | | | | sepallength < 7.2 : Iris-versicolor | | | | sepallength >= 7.2 : Iris-virginica | petalwidth >= 1.8 | | petallength < 4.9 | | | sepallength < 6 : Iris-versicolor | | | sepallength >= 6 : Iris-virginica | | petallength >= 4.9 : Iris-virginica ii. cars - regression speed < 30 | speed < 12 | | speed < 3 : 4 | | speed >= 3 | | | speed < 7 : 7 | | | speed >= 7 : 6.5 | speed >= 12 | | speed < 23 | | | speed < 21 | | | | speed < 19 | | | | | speed < 15 : 12 | | | | | speed >= 15 | | | | | | speed < 16.5 : 8 | | | | | | speed >= 16.5 | | | | | | | speed < 17.5 : 11 | | | | | | | speed >= 17.5 : 10 | | | | speed >= 19 : 13.5 | | | speed >= 21 : 7 | | speed >= 23 | | | speed < 27 | | | | speed < 25 : 12 | | | | speed >= 25 : 13 | | | speed >= 27 : 11.5 speed >= 30 | speed < 84.5 ---snip--- And "the ArffDataLoader" can read ARFF format data file ans is good for making the test easy. These 2 additions are contained on the last regression.patch. Regards,
          Hide
          Deneche A. Hakim added a comment -

          Wow, thank you for your contribution. I will make sure to review the patch as soon as possible.

          Show
          Deneche A. Hakim added a comment - Wow, thank you for your contribution. I will make sure to review the patch as soon as possible.
          Hide
          Deneche A. Hakim added a comment -

          Ok, I applied the patch on the trunk and it went ok. All the tests are fine. I will run some classification examples first and see if they are still ok.

          Show
          Deneche A. Hakim added a comment - Ok, I applied the patch on the trunk and it went ok. All the tests are fine. I will run some classification examples first and see if they are still ok.
          Hide
          Ikumasa Mukai added a comment -

          Thank you very much for your checking.

          Please let me know if you have a question because the patch contains some functions and my english explanation isn't good..

          Regards,

          Show
          Ikumasa Mukai added a comment - Thank you very much for your checking. Please let me know if you have a question because the patch contains some functions and my english explanation isn't good.. Regards,
          Hide
          Deneche A. Hakim added a comment -

          the examples went all fine, I just committed the patch after making the following changes:

          • removed TreePrinter, ForestPrinter, ArffDataLoader, ArffData, ArffInvalidFormatException. These are really helpful additions but they aren't needed to support Regression. So they should probably be submitted in a new jira post (go ahead and create one)
          • I made some style modifications using IntelliJ code inspection

          I do have one question: I see that DecisionTreeBuilder supports both classification and regression tree. Should we remove DefaultTreeBuilder and just use DecisionTreeBuilder instead ?

          Show
          Deneche A. Hakim added a comment - the examples went all fine, I just committed the patch after making the following changes: removed TreePrinter, ForestPrinter, ArffDataLoader, ArffData, ArffInvalidFormatException. These are really helpful additions but they aren't needed to support Regression. So they should probably be submitted in a new jira post (go ahead and create one) I made some style modifications using IntelliJ code inspection I do have one question: I see that DecisionTreeBuilder supports both classification and regression tree. Should we remove DefaultTreeBuilder and just use DecisionTreeBuilder instead ?
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1244 (See https://builds.apache.org/job/Mahout-Quality/1244/)
          MAHOUT-840 Build and Test regression forests

          adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1213034
          Files :

          • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/RegressionResultAnalyzer.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/DFUtils.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/DecisionForest.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/ErrorEstimate.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/builder/DecisionTreeBuilder.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/builder/DefaultTreeBuilder.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/Data.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/DataConverter.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/DataLoader.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/DataUtils.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/Dataset.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/DescriptorUtils.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/Instance.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/Builder.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/Classifier.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/MapredOutput.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/inmem/InMemBuilder.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/inmem/InMemMapper.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialBuilder.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1Mapper.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/node/CategoricalNode.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/node/Leaf.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/node/Node.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/node/NumericalNode.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/DefaultIgSplit.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/IgSplit.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/OptIgSplit.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/RegressionSplit.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/Split.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/tools/Frequencies.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/tools/FrequenciesJob.java
          • /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/tools/UDistrib.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/data/DataConverterTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/data/DataLoaderTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/data/DataTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/data/DatasetTest.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialSequentialBuilder.java
          • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/node/NodeTest.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/BreimanExample.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/BuildForest.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/TestForest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1244 (See https://builds.apache.org/job/Mahout-Quality/1244/ ) MAHOUT-840 Build and Test regression forests adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1213034 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/RegressionResultAnalyzer.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/DFUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/DecisionForest.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/ErrorEstimate.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/builder/DecisionTreeBuilder.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/builder/DefaultTreeBuilder.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/Data.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/DataConverter.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/DataLoader.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/DataUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/Dataset.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/DescriptorUtils.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/data/Instance.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/Builder.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/Classifier.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/MapredOutput.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/inmem/InMemBuilder.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/inmem/InMemMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialBuilder.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/mapreduce/partial/Step1Mapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/node/CategoricalNode.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/node/Leaf.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/node/Node.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/node/NumericalNode.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/DefaultIgSplit.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/IgSplit.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/OptIgSplit.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/RegressionSplit.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/Split.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/tools/Frequencies.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/tools/FrequenciesJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/tools/UDistrib.java /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/data/DataConverterTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/data/DataLoaderTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/data/DataTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/data/DatasetTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/mapreduce/partial/PartialSequentialBuilder.java /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/node/NodeTest.java /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/BreimanExample.java /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/BuildForest.java /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/TestForest.java
          Hide
          Ikumasa Mukai added a comment -

          Thank you so much for adopting my patch!
          I am glad I can contribute to your project.

          > DecisionTreeBuilder supports both classification and regression tree.
          > Should we remove DefaultTreeBuilder and just use DecisionTreeBuilder instead ?

          Yes, I think so.
          Because we can get same result if we use DecisionTreeBuilder with disabling pruning and complementing functions.
          (I made DecisionTreeBuilder refering to DefaultTreeBuilder.)

          So I will make an additional patch for enabling to change the params(minSplitNum, complemented etc..) for DecisionTreeBuilder from outside(command line), and will attach to this jira.

          Regards,

          Show
          Ikumasa Mukai added a comment - Thank you so much for adopting my patch! I am glad I can contribute to your project. > DecisionTreeBuilder supports both classification and regression tree. > Should we remove DefaultTreeBuilder and just use DecisionTreeBuilder instead ? Yes, I think so. Because we can get same result if we use DecisionTreeBuilder with disabling pruning and complementing functions. (I made DecisionTreeBuilder refering to DefaultTreeBuilder.) So I will make an additional patch for enabling to change the params(minSplitNum, complemented etc..) for DecisionTreeBuilder from outside(command line), and will attach to this jira. Regards,
          Hide
          Ikumasa Mukai added a comment -

          Hi Hakim-san.

          I made a patch for makeing the decisionTreeBuilder default.

          And this is the help.

          Usage:                                                                          
           [--data <path> --dataset <dataset> --selection <m> --no-complete --minsplit    
          <minsplit> --minprop <minprop> --seed <seed> --partial --nbtrees <nbtrees>      
          --output <path> --help]                                                         
          Options                                                                         
            --data (-d) path             Data path                                        
            --dataset (-ds) dataset      Dataset path                                     
            --selection (-sl) m          Optional, Number of variables to select randomly 
                                         at each tree-node.                               
                                         For classification problem, the default is       
                                         square root of the number of explanatory         
                                         variables.                                       
                                         For regression problem, the default is 1/3 of    
                                         the number of explanatory variables.             
            --no-complete (-nc)          Optional, The tree is not complemented           
            --minsplit (-ms) minsplit    Optional, The tree-node is not divided, if the   
                                         branching data size is smaller than this value.  
                                         The default is 2.                                
            --minprop (-mp) minprop      Optional, The tree-node is not divided, if the   
                                         proportion of the variance of branching data is  
                                         smaller than this value.                         
                                         The default is 1/1000(0.001).                    
            --seed (-sd) seed            Optional, seed value used to initialise the      
                                         Random number generator                          
            --partial (-p)               Optional, use the Partial Data implementation    
            --nbtrees (-t) nbtrees       Number of trees to grow                          
            --output (-o) path           Output path, will contain the Decision Forest    
            --help (-h)                  Print out help
          

          In summary, I added these 3 options

          --no-complete (-nc)
          --minsplit (-ms) minsplit
          --minprop (-mp) minprop
          

          and change the condition of "--selection(-sl)" option from Required to Optional because I think the appropriate value can be calculated.

          Would you please check my patch?
          (this patch doesn't have the removal of defaultBuilder)

          Regards,

          Show
          Ikumasa Mukai added a comment - Hi Hakim-san. I made a patch for makeing the decisionTreeBuilder default. And this is the help. Usage: [--data <path> --dataset <dataset> --selection <m> --no-complete --minsplit <minsplit> --minprop <minprop> --seed <seed> --partial --nbtrees <nbtrees> --output <path> --help] Options --data (-d) path Data path --dataset (-ds) dataset Dataset path --selection (-sl) m Optional, Number of variables to select randomly at each tree-node. For classification problem, the default is square root of the number of explanatory variables. For regression problem, the default is 1/3 of the number of explanatory variables. --no-complete (-nc) Optional, The tree is not complemented --minsplit (-ms) minsplit Optional, The tree-node is not divided, if the branching data size is smaller than this value. The default is 2. --minprop (-mp) minprop Optional, The tree-node is not divided, if the proportion of the variance of branching data is smaller than this value. The default is 1/1000(0.001). --seed (-sd) seed Optional, seed value used to initialise the Random number generator --partial (-p) Optional, use the Partial Data implementation --nbtrees (-t) nbtrees Number of trees to grow --output (-o) path Output path, will contain the Decision Forest --help (-h) Print out help In summary, I added these 3 options --no-complete (-nc) --minsplit (-ms) minsplit --minprop (-mp) minprop and change the condition of "--selection(-sl)" option from Required to Optional because I think the appropriate value can be calculated. Would you please check my patch? (this patch doesn't have the removal of defaultBuilder) Regards,
          Hide
          Deneche A. Hakim added a comment -

          I committed the patch, thanks again Ikumasa for this great work.

          Show
          Deneche A. Hakim added a comment - I committed the patch, thanks again Ikumasa for this great work.
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #1267 (See https://builds.apache.org/job/Mahout-Quality/1267/)
          MAHOUT-840 DecisionTreeBuilder Test
          MAHOUT-840 DecisionTreeBuilder is now the default tree builder

          adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1222595
          Files :

          • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/builder/DecisionTreeBuilderTest.java

          adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1222594
          Files :

          • /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/builder/InfiniteRecursionTest.java
          • /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/BuildForest.java
          Show
          Hudson added a comment - Integrated in Mahout-Quality #1267 (See https://builds.apache.org/job/Mahout-Quality/1267/ ) MAHOUT-840 DecisionTreeBuilder Test MAHOUT-840 DecisionTreeBuilder is now the default tree builder adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1222595 Files : /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/builder/DecisionTreeBuilderTest.java adeneche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1222594 Files : /mahout/trunk/core/src/test/java/org/apache/mahout/classifier/df/builder/InfiniteRecursionTest.java /mahout/trunk/examples/src/main/java/org/apache/mahout/classifier/df/mapreduce/BuildForest.java
          Hide
          Deneche A. Hakim added a comment -

          The issue is solved. Let's wait to make sure the changes don't have any side effects

          Show
          Deneche A. Hakim added a comment - The issue is solved. Let's wait to make sure the changes don't have any side effects

            People

            • Assignee:
              Deneche A. Hakim
              Reporter:
              Deneche A. Hakim
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development