Mahout
  1. Mahout
  2. MAHOUT-122

Random Forests Reference Implementation

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2
    • Fix Version/s: 0.2
    • Component/s: Classification
    • Labels:
      None

      Description

      This is the first step of my GSOC project. Implement a simple, easy to understand, reference implementation of Random Forests (Building and Classification). The only requirement here is that "it works"

      1. refimp_Sep27.patch
        173 kB
        Deneche A. Hakim
      2. refimp_Sep_15.patch
        173 kB
        Deneche A. Hakim
      3. refimp_Jul7.diff
        160 kB
        Deneche A. Hakim
      4. refimp_Jul6.diff
        158 kB
        Deneche A. Hakim
      5. 3w_patch.diff
        120 kB
        Deneche A. Hakim
      6. 2w_patch.diff
        89 kB
        Deneche A. Hakim
      7. RF reference.patch
        41 kB
        Deneche A. Hakim

        Issue Links

          Activity

          Hide
          Deneche A. Hakim added a comment -

          This is a basic reference implementation that builds a forest and estimates the out of bag error along the way. I used Mahout's Vector to represent the data instances.
          There are no tests at all, I could add them but it will probably take another week, so are they worth it (I mean for a reference imp.) ?
          Here is a short description of the files:

          • DataUtils.java and Condition.jva: contains helpers methods that deals with of Vector(s) and arrays of values. I don't think they'll play an important role in the distributed version (they need access to all the data at some point)
          • Dataset.java: describes the attributes of the data, how many they are, which one are NUMERICAL and which one are NOMINAL. also sometimes its useful to be able to IGNORE some attributes (for e.g. ID attributes)
          • DataLoader.java: a simple class that converts an array of comma-separated strings (this is the basic format of the UCI datasets) to a Vector List.
            It is important to note that to be able to store the NOMINAL attributes, which generally are of type String, this method needs to parse all the data to be able to convert those values into Integer.
            To keep things simple, IGNORED attributes are replaced with Double.NaN, but a more efficient implementation would be to skip them.
            For now this DataLoader does not know how to deal with missing attributes ('?'), an alternative would be to skip the data lines that contains missing attributes
          • PostOpData.java: contains the data of the Post Operative Dataset from UCI
          • InformationGain.java: entropy and Information Gain calculations. Because those calculations need access to all the data for a given tree node, these methods should probably be rewritten in a map-reduce form
          • DecisionTree.java: contains a simple implementation of the LearnUnprunedTree algorithm described in the Wiki. A special case that was not described in the algorithm is when the data becomes empty (happens from time to time), in this case (and based on the Weka implementation of Random Forests) the method should return a Leaf that predicts -1 (which means it cannot predict the label)
            For now the output of the classification is the predicted label, or -1 if no label could be predicted. A better solution would be to return the label's probability distribution
          • Node.java: represents a tree node, two implementations are available for NOMINAL and NUMERICAL attributes
          • ErrorEstimate.java: computes the errors for each distinct label and the global error rate, using the confusion matrix generated by the building algorithm
          • RandomForest.java: uses DecisionTree to build each tree of the forest. Contains a simple example (in the main method), after the execution the program outputs the error rates for each label followed by the total error rate

          I started to think about the distributed implementation, but I could (if necessary) spend some time writing the tests for the reference implementation.

          Show
          Deneche A. Hakim added a comment - This is a basic reference implementation that builds a forest and estimates the out of bag error along the way. I used Mahout's Vector to represent the data instances. There are no tests at all, I could add them but it will probably take another week, so are they worth it (I mean for a reference imp.) ? Here is a short description of the files: DataUtils.java and Condition.jva: contains helpers methods that deals with of Vector(s) and arrays of values. I don't think they'll play an important role in the distributed version (they need access to all the data at some point) Dataset.java: describes the attributes of the data, how many they are, which one are NUMERICAL and which one are NOMINAL. also sometimes its useful to be able to IGNORE some attributes (for e.g. ID attributes) DataLoader.java: a simple class that converts an array of comma-separated strings (this is the basic format of the UCI datasets) to a Vector List. It is important to note that to be able to store the NOMINAL attributes, which generally are of type String, this method needs to parse all the data to be able to convert those values into Integer. To keep things simple, IGNORED attributes are replaced with Double.NaN, but a more efficient implementation would be to skip them. For now this DataLoader does not know how to deal with missing attributes ('?'), an alternative would be to skip the data lines that contains missing attributes PostOpData.java: contains the data of the Post Operative Dataset from UCI InformationGain.java: entropy and Information Gain calculations. Because those calculations need access to all the data for a given tree node, these methods should probably be rewritten in a map-reduce form DecisionTree.java: contains a simple implementation of the LearnUnprunedTree algorithm described in the Wiki. A special case that was not described in the algorithm is when the data becomes empty (happens from time to time), in this case (and based on the Weka implementation of Random Forests) the method should return a Leaf that predicts -1 (which means it cannot predict the label) For now the output of the classification is the predicted label, or -1 if no label could be predicted. A better solution would be to return the label's probability distribution Node.java: represents a tree node, two implementations are available for NOMINAL and NUMERICAL attributes ErrorEstimate.java: computes the errors for each distinct label and the global error rate, using the confusion matrix generated by the building algorithm RandomForest.java: uses DecisionTree to build each tree of the forest. Contains a simple example (in the main method), after the execution the program outputs the error rates for each label followed by the total error rate I started to think about the distributed implementation, but I could (if necessary) spend some time writing the tests for the reference implementation.
          Hide
          Otis Gospodnetic added a comment -

          I can't comment on RF impl, but did notice that applying the patch issued warnings about CRs being stripped. I guess that's a good thing...

          Show
          Otis Gospodnetic added a comment - I can't comment on RF impl, but did notice that applying the patch issued warnings about CRs being stripped. I guess that's a good thing...
          Hide
          Deneche A. Hakim added a comment -

          I can't comment on RF impl, but did notice that applying the patch issued warnings about CRs being stripped. I guess that's a good thing...

          That's strange, I applied the patch to a fresh checkout of Mahout and launched mvn install, and all went ok!
          Could be because I am using TortoiseSVN to create/apply the patches (although it worked well in the past) ?

          Show
          Deneche A. Hakim added a comment - I can't comment on RF impl, but did notice that applying the patch issued warnings about CRs being stripped. I guess that's a good thing... That's strange, I applied the patch to a fresh checkout of Mahout and launched mvn install, and all went ok! Could be because I am using TortoiseSVN to create/apply the patches (although it worked well in the past) ?
          Hide
          Deneche A. Hakim added a comment -

          second week patch
          work in progress...

          changes:

          • added many tests, although some are still missing
          • added a new class "Instance" that allows me to add an ID and a separate LABEL to a Vector
          • DataLoader.loadData(String, FileSystem, Path) loads the data from a file, IGNORED attributes are skipped
          • Dataset handles only NUMERICAL and CATEGORICAL attributes
            • contains List<String> that represents the labels as found in the data, before being converted to int
          • added a new class "Data" that represents the data being loaded
            • contains methods to create subset from the current Data
            • the only way to get a new Data instance is to load it with DataLoader, or to use methods from an existing Data instance
            • this class could prove useful later to optimize the memory usage of the data
          • ForestBuilder.buildForest uses a PredictionCallback to collect the oob predictions, by changing the callback we can compute different errors rate, for example:
            • Forest out-of-bag error estimation
            • mean tree error rate
            • ...
          • I added a small running example in ForestBuilder.main(), this example shows a typical use of Random Forests:
            • loads the data from a file, you'll need to provide a descriptor. For example UciDescriptors.java contains the descriptors for the "glass" and "post-operative" UCI datasets, the datasets are available at the UCI web site)
            • reserves 10% of the data as a test set (not used for now)
            • builds a random forest using the remaining data
            • computes the oob error estimation
            • this procedure is repeated 100 times and the mean oob error estimation is printed

          if you want to try the example, you'll need to download the "post-operative" dataset, or the "glass" dataset from UCI, put it somewhere, and change the first line of ForestBuilder.main() to the correct path, and use the corresponding UciDescriptor in the third line.

          Note about memory usage:

          • the reference implementation loads the data in-memory, then builds the trees one at a time
          • each tree is built recursively using DecisionTree.learnUnprunedTree(), at each node the data is split and learnUnprunedTree() is called for each subset
          • the current implementation of "Data" is not memory efficient, each subset keeps it own copy of its part of the data, thus, except when there are Leaf nodes, each level of the tree generates one more copy of the data in memory

          Whats next:

          • RandomForest class that will contain the result of the forest building, can be stored/loaded from a file
          • try the implementation on the same UCI datasets as the Breiman's paper, using the same complete procedure
          • do some memory usage monitoring
          Show
          Deneche A. Hakim added a comment - second week patch work in progress... changes: added many tests, although some are still missing added a new class "Instance" that allows me to add an ID and a separate LABEL to a Vector DataLoader.loadData(String, FileSystem, Path) loads the data from a file, IGNORED attributes are skipped Dataset handles only NUMERICAL and CATEGORICAL attributes contains List<String> that represents the labels as found in the data, before being converted to int added a new class "Data" that represents the data being loaded contains methods to create subset from the current Data the only way to get a new Data instance is to load it with DataLoader, or to use methods from an existing Data instance this class could prove useful later to optimize the memory usage of the data ForestBuilder.buildForest uses a PredictionCallback to collect the oob predictions, by changing the callback we can compute different errors rate, for example: Forest out-of-bag error estimation mean tree error rate ... I added a small running example in ForestBuilder.main(), this example shows a typical use of Random Forests: loads the data from a file, you'll need to provide a descriptor. For example UciDescriptors.java contains the descriptors for the "glass" and "post-operative" UCI datasets, the datasets are available at the UCI web site) reserves 10% of the data as a test set (not used for now) builds a random forest using the remaining data computes the oob error estimation this procedure is repeated 100 times and the mean oob error estimation is printed if you want to try the example, you'll need to download the "post-operative" dataset, or the "glass" dataset from UCI, put it somewhere, and change the first line of ForestBuilder.main() to the correct path, and use the corresponding UciDescriptor in the third line. Note about memory usage: the reference implementation loads the data in-memory, then builds the trees one at a time each tree is built recursively using DecisionTree.learnUnprunedTree(), at each node the data is split and learnUnprunedTree() is called for each subset the current implementation of "Data" is not memory efficient, each subset keeps it own copy of its part of the data, thus, except when there are Leaf nodes, each level of the tree generates one more copy of the data in memory Whats next: RandomForest class that will contain the result of the forest building, can be stored/loaded from a file try the implementation on the same UCI datasets as the Breiman's paper, using the same complete procedure do some memory usage monitoring
          Hide
          Deneche A. Hakim added a comment -

          I've been reading Breiman's paper about Random Forests (available here), and in page 9 he says:

          "Grow the tree using CART methodology to maximum size and do not prune."

          So apparently he uses the CART algorithm to grow the trees, and if I'm not wrong, it differs from the algorithm that I described int the wiki http://cwiki.apache.org/MAHOUT/random-forests.html. The most important is the way it splits CATEGORICAL attributes:

          • in the algorithm that I'm using a node is built for each value of the attribute
          • in CART a best split value is found (in a similar way to NUMERICAL attributes) and only two nodes are built given that the attribute's value is equal or not to the split value

          I think that the best thing to do is to create an abstract DecisionTreeBuilder class, this way we can use whatever implementation we want

          Show
          Deneche A. Hakim added a comment - I've been reading Breiman's paper about Random Forests ( available here ), and in page 9 he says: "Grow the tree using CART methodology to maximum size and do not prune." So apparently he uses the CART algorithm to grow the trees, and if I'm not wrong, it differs from the algorithm that I described int the wiki http://cwiki.apache.org/MAHOUT/random-forests.html . The most important is the way it splits CATEGORICAL attributes: in the algorithm that I'm using a node is built for each value of the attribute in CART a best split value is found (in a similar way to NUMERICAL attributes) and only two nodes are built given that the attribute's value is equal or not to the split value I think that the best thing to do is to create an abstract DecisionTreeBuilder class, this way we can use whatever implementation we want
          Hide
          Deneche A. Hakim added a comment - - edited

          whats new:
          I added the results for KDD 10% with 10 trees. I tried also to build one single tree with KDD 50% and after more than 12 hours !!! of computing I gave up

          I was wrong about the memory usage of the current implementation, even that each node has its own Data object, all the Data object still share the same Instance objects which all the actual data.

          I did some profiling and I found that "InformationGain.computeSplit()" method takes nearly 98.5% of total time, this is responsible for computing the Information Gain for the current split. So if we want later to optimize this implementation we'll have to use a better algorithm to compute the Information Gain, the one that I'm aware of and which is available in the Weka source code, computes
          the sorting indices for the data with each attribute.

          I also did some memory usage profiling using a Runnable that samples every 50ms a rough estimation of memory usage using (Runtime.getTotalMemory() - Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then created different datasets using subsets of different size (1%, 10%, 25%, 50%). Here are the results :

          KDD has 41 attributes (stored as "double")
          KDD 1% has 49402 instances
          KDD 10% has 494021 instances
          KDD 25% has 1224607 instances

          Dataset Nb Trees MUALD Max Used Memory Nb Nodes Max Tree Depth
          KDD 1% 1 35.414.504 b 38.069.640 b 120 10
          KDD 1% 10 35.144.096 b 45.669.904 b 126 (mean) 11 (mean)
          KDD 10% 1 201.697.512 b 226.653.392 b 712 22
          KDD 10% 10 201.697.512 b 276.780.280 b 870 (mean) 29 (mean)
          KDD 25% 1 521.515.136 b 569.795.152 b 930 26

          Memory used right after loading the Data

          I should run more tests using KDD 50% and KDD 100%, and also building more trees to see how the memory usage behaves. But because the current implementation is very slow, it may take some time

          Show
          Deneche A. Hakim added a comment - - edited whats new: I added the results for KDD 10% with 10 trees. I tried also to build one single tree with KDD 50% and after more than 12 hours !!! of computing I gave up I was wrong about the memory usage of the current implementation, even that each node has its own Data object, all the Data object still share the same Instance objects which all the actual data. I did some profiling and I found that "InformationGain.computeSplit()" method takes nearly 98.5% of total time, this is responsible for computing the Information Gain for the current split. So if we want later to optimize this implementation we'll have to use a better algorithm to compute the Information Gain, the one that I'm aware of and which is available in the Weka source code, computes the sorting indices for the data with each attribute. I also did some memory usage profiling using a Runnable that samples every 50ms a rough estimation of memory usage using (Runtime.getTotalMemory() - Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then created different datasets using subsets of different size (1%, 10%, 25%, 50%). Here are the results : KDD has 41 attributes (stored as "double") KDD 1% has 49402 instances KDD 10% has 494021 instances KDD 25% has 1224607 instances Dataset Nb Trees MUALD Max Used Memory Nb Nodes Max Tree Depth KDD 1% 1 35.414.504 b 38.069.640 b 120 10 KDD 1% 10 35.144.096 b 45.669.904 b 126 (mean) 11 (mean) KDD 10% 1 201.697.512 b 226.653.392 b 712 22 KDD 10% 10 201.697.512 b 276.780.280 b 870 (mean) 29 (mean) KDD 25% 1 521.515.136 b 569.795.152 b 930 26 Memory used right after loading the Data I should run more tests using KDD 50% and KDD 100%, and also building more trees to see how the memory usage behaves. But because the current implementation is very slow, it may take some time
          Hide
          Deneche A. Hakim added a comment -

          I did some tests on some of the datasets used in Breiman's paper to compare the results of the reference implementation.

          The test procedure described in Breiman's paper is as follows :

          • 10% of the dataset is kept apart as a testing set
          • for each dataset we build two forests, one with m=int(log2(M)+1) (called Random-Input) and one with m=1 (called Single-Input)
          • we used the forest that gave the lowest oob error estimation to compute the test set error
          • we compute the test set error using the Single Input Forest, to show that even when m=1 Random Forests give comparable results to greater values of m
          • we compute the mean test set error using every tree of the forest that gave the lowest oob error. This shows how a single Decision Tree performs

          In the following tables:
          *Selection: test set result with the forest that gave the lowest oob error
          *Single Input: test set result with the Single-Input forest

          • One Tree: Mean Tree test set error

          Breiman's results :

          Data Selection Single Input One Tree
          glass 20.6 21.2 36.9
          breast cancer 2.9 2.7 6.3
          diabetes 24.2 24.3 33.1
          sonar 15.9 18.0 31.7
          ionosphere 7.1 7.5 12.7
          vehicle 25.8 26.4 33.1
          german 24.4 26.2 33.3

          Reference Implementation results :
          I also included the how much system time (mean) each forest (Random-Input or Single-Input) took to build

          Data Selection Single Input One Tree Mean RI Time Mean SI Time
          glass 24.8 23.9 41.2 9s 19 2s 667
          breast cancer 2.8 2.7 5.8 2s 588 1s 60
          diabetes 24.5 24.6 32.1 34s 875 10s 284
          sonar 14.6 15.3 32.3 10s 89 2s 227
          ionosphere 7.1 7.0 15.5 33s 190 6s 96
          vehicle 25.3 26.4 33.7 42s 194 10s 21
          german 23.15 25.27 32.8 10s 203 3s 654
          Show
          Deneche A. Hakim added a comment - I did some tests on some of the datasets used in Breiman's paper to compare the results of the reference implementation. The test procedure described in Breiman's paper is as follows : 10% of the dataset is kept apart as a testing set for each dataset we build two forests, one with m=int(log2(M)+1) (called Random-Input) and one with m=1 (called Single-Input) we used the forest that gave the lowest oob error estimation to compute the test set error we compute the test set error using the Single Input Forest, to show that even when m=1 Random Forests give comparable results to greater values of m we compute the mean test set error using every tree of the forest that gave the lowest oob error. This shows how a single Decision Tree performs In the following tables: *Selection: test set result with the forest that gave the lowest oob error *Single Input: test set result with the Single-Input forest One Tree: Mean Tree test set error Breiman's results : Data Selection Single Input One Tree glass 20.6 21.2 36.9 breast cancer 2.9 2.7 6.3 diabetes 24.2 24.3 33.1 sonar 15.9 18.0 31.7 ionosphere 7.1 7.5 12.7 vehicle 25.8 26.4 33.1 german 24.4 26.2 33.3 Reference Implementation results : I also included the how much system time (mean) each forest (Random-Input or Single-Input) took to build Data Selection Single Input One Tree Mean RI Time Mean SI Time glass 24.8 23.9 41.2 9s 19 2s 667 breast cancer 2.8 2.7 5.8 2s 588 1s 60 diabetes 24.5 24.6 32.1 34s 875 10s 284 sonar 14.6 15.3 32.3 10s 89 2s 227 ionosphere 7.1 7.0 15.5 33s 190 6s 96 vehicle 25.3 26.4 33.7 42s 194 10s 21 german 23.15 25.27 32.8 10s 203 3s 654
          Hide
          Deneche A. Hakim added a comment -

          3rd Week Patch
          work in progress...

          Changes

          • ForestBuilder becomes an object that uses a TreeBuilder object
          • RandomForest represents a...guess what ! it has methods to classify single instances or bunch of data. Contains also methods to compute the total and mean
            number of nodes and mean max depth of the trees
          • Added more PredictionCallback implementations!
            • MeanTreeCollector computes the mean classification error among all the trees of the forest
            • MultiCallback allows many callbacks to be passed to the same classification method
          • BreimanExample is a running example similar to the testing procedures used in Breiman's paper about Random Forests
          • MemoryUsage is a running app used to collect the stats about memory usage
          • DataSplit, a temporary app, allows to split the KDD dataset (1%, 10%, 25%, 50%)
          • TreeBuilder is an abstract class that builds a Decision Tree given a Data instance
          • DefaultTreeBuilder implementation of a TreeBuilder based on Andrew W. Moore Decision Trees tutorial

          What's next

          • some more memory usage tests
          • I think its time to start with the map-reduce implementation, the results of the memory usage tests should help us decide which implementation to pursue
          Show
          Deneche A. Hakim added a comment - 3rd Week Patch work in progress... Changes ForestBuilder becomes an object that uses a TreeBuilder object RandomForest represents a...guess what ! it has methods to classify single instances or bunch of data. Contains also methods to compute the total and mean number of nodes and mean max depth of the trees Added more PredictionCallback implementations! MeanTreeCollector computes the mean classification error among all the trees of the forest MultiCallback allows many callbacks to be passed to the same classification method BreimanExample is a running example similar to the testing procedures used in Breiman's paper about Random Forests MemoryUsage is a running app used to collect the stats about memory usage DataSplit, a temporary app, allows to split the KDD dataset (1%, 10%, 25%, 50%) TreeBuilder is an abstract class that builds a Decision Tree given a Data instance DefaultTreeBuilder implementation of a TreeBuilder based on Andrew W. Moore Decision Trees tutorial What's next some more memory usage tests I think its time to start with the map-reduce implementation, the results of the memory usage tests should help us decide which implementation to pursue
          Hide
          Ted Dunning added a comment -

          Just for grins, I plotted the max memory usage versus number of instances. The relationship (so far) is very much a straight line. The fit that R gives me is 450 (se=25) bytes per data instance with a fixed overhead of 23MB (se=14MB). The fixed overhead can't necessarily be distinguished from zero, but it looks right.

          The 450 byte overhead per training instance seems a little bit high, but I don't know the data well so it might be pretty reasonable. The original data size was about 100 bytes.

          Show
          Ted Dunning added a comment - Just for grins, I plotted the max memory usage versus number of instances. The relationship (so far) is very much a straight line. The fit that R gives me is 450 (se=25) bytes per data instance with a fixed overhead of 23MB (se=14MB). The fixed overhead can't necessarily be distinguished from zero, but it looks right. The 450 byte overhead per training instance seems a little bit high, but I don't know the data well so it might be pretty reasonable. The original data size was about 100 bytes.
          Hide
          Deneche A. Hakim added a comment -

          The 450 byte overhead per training instance seems a little bit high, but I don't know the data well so it might be pretty reasonable. The original data size was about 100 bytes.

          I may be able to explain this overhead:

          • First of all, the memory estimations that I've done didn't account for the memory not yet garbage collected, so I've run the tests again and this time a launched the Garbage Collector just after loading the data;
          • In a separate run, I allocated a double[nb instances][nb attributes] and noted how much memory is used
          Dataset Data size (nb instances x nb attributes) Mem. used by double[nb instances][nb attributes] MUALD
          KDD 1% 49.402 x 42 19.050.312 B 22.331.360 B
          KDD 10% 494.021 x 42 178.094.200 B 204.659.576 B
          KDD 25% 1.224.607 x 42 438.395.224 B 500.341.256 B
          KDD 50% 2.449.215 x 42 873.266.456 B 998.331.560 B

          Most of the overhead is caused by how the instances are represented in memory, I'm using a DenseVector so all the attributes are stored in a double[], this means that each attribute uses 8 B of memory. By examining the original data, we can see that most of the attributes contain at most 3 digits and because that are stored as text they take at most 4 B if we count the separator.

          I suppose that the difference between MUALD and the memory used by double[][] is caused by the way the jvm stores the references to the instances' objects.

          Show
          Deneche A. Hakim added a comment - The 450 byte overhead per training instance seems a little bit high, but I don't know the data well so it might be pretty reasonable. The original data size was about 100 bytes. I may be able to explain this overhead: First of all, the memory estimations that I've done didn't account for the memory not yet garbage collected, so I've run the tests again and this time a launched the Garbage Collector just after loading the data; In a separate run, I allocated a double [nb instances] [nb attributes] and noted how much memory is used Dataset Data size (nb instances x nb attributes) Mem. used by double [nb instances] [nb attributes] MUALD KDD 1% 49.402 x 42 19.050.312 B 22.331.360 B KDD 10% 494.021 x 42 178.094.200 B 204.659.576 B KDD 25% 1.224.607 x 42 438.395.224 B 500.341.256 B KDD 50% 2.449.215 x 42 873.266.456 B 998.331.560 B Most of the overhead is caused by how the instances are represented in memory, I'm using a DenseVector so all the attributes are stored in a double[], this means that each attribute uses 8 B of memory. By examining the original data, we can see that most of the attributes contain at most 3 digits and because that are stored as text they take at most 4 B if we count the separator. I suppose that the difference between MUALD and the memory used by double[][] is caused by the way the jvm stores the references to the instances' objects.
          Hide
          Deneche A. Hakim added a comment -

          Optimization patch
          Just the reference implementation, does not contain the mapred implementation

          changes

          • the class InformationGain responsible for finding the best split using Information Gain has been renamed IgSplit and is now abstract, the class DefaultIgSplit contains the current implementation, and I added an optimized version OptIgSplit
          • examples.CpuTest is a small program (it uses CLI by the way) to test the optimizations. This program works as follows:
            • the program loads kdd50%, just to reduce the loading time;
            • using the passed seed, the program creates a random number generator;
            • the program instantiates a TreeBuilder and passes it a DefaultIgSplit or a OptIgSplit depending on the passed parameters
            • the program repeats N times (N is a CommandLine parameter)
              • using the passed ratio (a number in the range (0,1]), the program creates a random subset from the data;
              • the program builds a single tree, selecting one random variable at each tree-node

          Here are some results on the KDD dataset, for each ratio we launch the program two times, one time with the DefaultIgSplit and another time with OptIgSplit. Each time we build 50 trees, and the initial seed is always 1L. I mesured the sum of build time for all the trees, it does not include the data loading nor the subset generation

          Ratio Default Optimized
          0.1% 1m 1s 336 2s 580
          0.5% 18m 34s 523 1m 21s 285
          1% 36m 37s 953 1m 40s 699
          5% 1h 43m 4s 899 2m 19s 745
          10% 6h 8m 46s 947 5m 8s 359

          Because the KDD dataset does not contain categorical attribute, this results are only for numerical attributes, I shall run another bunch of tests using another dataset. There is still room for improvement, but it requires sorting the whole dataset for each of its numerical attributes. This could be possible by processing the dataset once and storing an optmization-friendly version for latter uses.

          Now that the building algorithm is fast enough, I tried it again with the Breiman's example and discovered (horrified) that the out-of-bag classification takes a lot of time: using the KDD10%, one tree is build in 5s with the optimized code, and the oob classification takes more than 19 minutes !!! I shall (must) investigate this issue as soon as possible.

          Show
          Deneche A. Hakim added a comment - Optimization patch Just the reference implementation, does not contain the mapred implementation changes the class InformationGain responsible for finding the best split using Information Gain has been renamed IgSplit and is now abstract, the class DefaultIgSplit contains the current implementation, and I added an optimized version OptIgSplit examples.CpuTest is a small program (it uses CLI by the way) to test the optimizations. This program works as follows: the program loads kdd50%, just to reduce the loading time; using the passed seed, the program creates a random number generator; the program instantiates a TreeBuilder and passes it a DefaultIgSplit or a OptIgSplit depending on the passed parameters the program repeats N times (N is a CommandLine parameter) using the passed ratio (a number in the range (0,1]), the program creates a random subset from the data; the program builds a single tree, selecting one random variable at each tree-node Here are some results on the KDD dataset, for each ratio we launch the program two times, one time with the DefaultIgSplit and another time with OptIgSplit. Each time we build 50 trees, and the initial seed is always 1L. I mesured the sum of build time for all the trees, it does not include the data loading nor the subset generation Ratio Default Optimized 0.1% 1m 1s 336 2s 580 0.5% 18m 34s 523 1m 21s 285 1% 36m 37s 953 1m 40s 699 5% 1h 43m 4s 899 2m 19s 745 10% 6h 8m 46s 947 5m 8s 359 Because the KDD dataset does not contain categorical attribute, this results are only for numerical attributes, I shall run another bunch of tests using another dataset. There is still room for improvement, but it requires sorting the whole dataset for each of its numerical attributes. This could be possible by processing the dataset once and storing an optmization-friendly version for latter uses. Now that the building algorithm is fast enough, I tried it again with the Breiman's example and discovered (horrified) that the out-of-bag classification takes a lot of time: using the KDD10%, one tree is build in 5s with the optimized code, and the oob classification takes more than 19 minutes !!! I shall (must) investigate this issue as soon as possible.
          Hide
          Deneche A. Hakim added a comment -

          I forgot to mention that I used Kdd50% instead of Kdd100% for CpuTest, so a Ratio of 10% in the CpuTest results means 5% of the whole Kdd dataset

          Show
          Deneche A. Hakim added a comment - I forgot to mention that I used Kdd50% instead of Kdd100% for CpuTest, so a Ratio of 10% in the CpuTest results means 5% of the whole Kdd dataset
          Hide
          Deneche A. Hakim added a comment -

          I did some tests on the "poker hand" dataset from UCI, it contains 8 categorical attributes and 1.000.000 instances. I got the following results (50 trees) :

          Ratio Default Optimized
          100% 11m 31s 253 8m 32s 446

          It seems that the default implementation is fast enough for categorical attributes, and the optimized version is faster.

          I also found the issue with the oob error estimation. The old code was:

          Data bag = data.bagging(rng);
          
          Node tree = treeBuilder.build(bag);
          
          // predict the label for the out-of-bag elements
          for (int index = 0; index < data.size(); index++) {
            Instance v = data.get(index);
          
            if (!bag.contains(v)) {
              int prediction = tree.classify(v);
              callback.prediction(treeId, v, prediction);
            }
          }
          

          The problem was with bag.contains(), commenting this test drop the build time from 21m 8s 473 to 5s 913. I modified Data.bag() to fill a given boolean array with which instances are sampled in the bag, and used it as follows:

          Arrays.fill(sampled, false);
          Data bag = data.bagging(rng, sampled);
          
          Node tree = treeBuilder.build(bag);
          
          // predict the label for the out-of-bag elements
          for (int index = 0; index < data.size(); index++) {
            Instance v = data.get(index);
          
            if (sampled[index] == false) {
              int prediction = tree.classify(v);
              callback.prediction(treeId, v, prediction);
            }
          }
          

          The new build time is 6s 777. I think this issue is solved (for now...)

          Show
          Deneche A. Hakim added a comment - I did some tests on the "poker hand" dataset from UCI, it contains 8 categorical attributes and 1.000.000 instances. I got the following results (50 trees) : Ratio Default Optimized 100% 11m 31s 253 8m 32s 446 It seems that the default implementation is fast enough for categorical attributes, and the optimized version is faster. I also found the issue with the oob error estimation. The old code was: Data bag = data.bagging(rng); Node tree = treeBuilder.build(bag); // predict the label for the out-of-bag elements for ( int index = 0; index < data.size(); index++) { Instance v = data.get(index); if (!bag.contains(v)) { int prediction = tree.classify(v); callback.prediction(treeId, v, prediction); } } The problem was with bag.contains(), commenting this test drop the build time from 21m 8s 473 to 5s 913 . I modified Data.bag() to fill a given boolean array with which instances are sampled in the bag, and used it as follows: Arrays.fill(sampled, false ); Data bag = data.bagging(rng, sampled); Node tree = treeBuilder.build(bag); // predict the label for the out-of-bag elements for ( int index = 0; index < data.size(); index++) { Instance v = data.get(index); if (sampled[index] == false ) { int prediction = tree.classify(v); callback.prediction(treeId, v, prediction); } } The new build time is 6s 777 . I think this issue is solved (for now...)
          Hide
          Deneche A. Hakim added a comment -

          This issue will be committed as part of MAHOUT-145

          Show
          Deneche A. Hakim added a comment - This issue will be committed as part of MAHOUT-145
          Hide
          Deneche A. Hakim added a comment -

          This patch contains the reference implementations and also some code that is common to all implementations.
          If someone could review it before I commit it (my first commit =P )

          Show
          Deneche A. Hakim added a comment - This patch contains the reference implementations and also some code that is common to all implementations. If someone could review it before I commit it (my first commit =P )
          Hide
          Deneche A. Hakim added a comment -
          • updated the patch with the latest changes in the trunk (common.* stuff)
          Show
          Deneche A. Hakim added a comment - updated the patch with the latest changes in the trunk (common.* stuff)
          Hide
          Deneche A. Hakim added a comment -
          Show
          Deneche A. Hakim added a comment - Added a quick start guide in the wiki: http://cwiki.apache.org/confluence/display/MAHOUT/Breiman+Example

            People

            • Assignee:
              Deneche A. Hakim
              Reporter:
              Deneche A. Hakim
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development