Mahout
  1. Mahout
  2. MAHOUT-145

PartialData mapreduce Random Forests

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.2
    • Fix Version/s: 0.2
    • Component/s: Classification
    • Labels:
      None

      Description

      This implementation is based on a suggestion by Ted:

      "modify the original algorithm to build multiple trees for different portions of the data. That loses some of the solidity of the original method, but could actually do better if the splits exposed non-stationary behavior."

      1. partial_August_2.patch
        325 kB
        Deneche A. Hakim
      2. partial_August_9.patch
        332 kB
        Deneche A. Hakim
      3. partial_August_10.patch
        341 kB
        Deneche A. Hakim
      4. partial_August_13.patch
        322 kB
        Deneche A. Hakim
      5. partial_August_15.patch
        325 kB
        Deneche A. Hakim
      6. partial_August_17.patch
        322 kB
        Deneche A. Hakim
      7. partial_August_19.patch
        388 kB
        Deneche A. Hakim
      8. partial_August_24.patch
        408 kB
        Deneche A. Hakim
      9. partial_August_27.patch
        446 kB
        Deneche A. Hakim
      10. partial_August_31.patch
        447 kB
        Deneche A. Hakim
      11. partial_Sep_15.patch
        448 kB
        Deneche A. Hakim
      12. partial_Sep_30.patch
        190 kB
        Deneche A. Hakim

        Issue Links

          Activity

          Hide
          Deneche A. Hakim added a comment -

          A possible implementation is as follows:

          • Use a custom InputFormat similar to TextInputFormat that returns all the lines of a split at ones in a Text or, better, a custom Writable that holds a String[].
          • the mapper simply converts the input lines to a Data instance and uses the reference implementation to build a tree.

          The custom InputFormat can be either a specialized NLineInputFormat with a custom RecordReader that returns all the lines of a split at ones; or inherit from FileInputFormat and uses the same custom RecordReader.
          The advantage of inheriting from NLineInputFormat is that it is easy to configure the number of lines (instances) to grow each tree, but reads all the data when generating the splits thus can slow down the implementation because the generation of the splits is done in the client machine.

          Show
          Deneche A. Hakim added a comment - A possible implementation is as follows: Use a custom InputFormat similar to TextInputFormat that returns all the lines of a split at ones in a Text or, better, a custom Writable that holds a String[]. the mapper simply converts the input lines to a Data instance and uses the reference implementation to build a tree. The custom InputFormat can be either a specialized NLineInputFormat with a custom RecordReader that returns all the lines of a split at ones; or inherit from FileInputFormat and uses the same custom RecordReader . The advantage of inheriting from NLineInputFormat is that it is easy to configure the number of lines (instances) to grow each tree, but reads all the data when generating the splits thus can slow down the implementation because the generation of the splits is done in the client machine.
          Hide
          Ted Dunning added a comment -

          What do you think about using a normal mapper structure where the map() method reads one line at a time, stores the record into memory and then does the tree building in the close() method of your mapper?

          This trick is used extensively in streaming. If you are using 0.18.* then you have to stash the output collector in an instance variable so that you can produce output (or just open a task specific output file). In 0.20, I think that the Context argument is passed to the close method to avoid that need. Because production of output in the close() is so important to some applications, you are guaranteed to be able to use the output collector in close().

          Show
          Ted Dunning added a comment - What do you think about using a normal mapper structure where the map() method reads one line at a time, stores the record into memory and then does the tree building in the close() method of your mapper? This trick is used extensively in streaming. If you are using 0.18.* then you have to stash the output collector in an instance variable so that you can produce output (or just open a task specific output file). In 0.20, I think that the Context argument is passed to the close method to avoid that need. Because production of output in the close() is so important to some applications, you are guaranteed to be able to use the output collector in close().
          Hide
          Deneche A. Hakim added a comment -

          What do you think about using a normal mapper structure where the map() method reads one line at a time, stores the record into memory and then does the tree building in the close() method of your mapper?

          Excellent idea ! and no need to create another custom InputFormat =D

          Show
          Deneche A. Hakim added a comment - What do you think about using a normal mapper structure where the map() method reads one line at a time, stores the record into memory and then does the tree building in the close() method of your mapper? Excellent idea ! and no need to create another custom InputFormat =D
          Hide
          Deneche A. Hakim added a comment -

          In the partial implementation, the input of the program is the data and T (number of trees). The data is split up between the mappers, but how many trees each mapper should build ?

          I've got two ideas:

          • [easiest] each mapper builds T trees on its subset of the data, this makes it easy to configure how many trees each mapper builds but its somewhat tricky to estimate the total number of trees because it will depend on FileInputFormat.getsplits() (min split size, block size, data size...)
          • each mapper builds T/M trees where M is the number of mappers available. The user sets the total number of trees, and the number of trees that each mapper builds will depend on the number of splits

          any suggestion ?

          Show
          Deneche A. Hakim added a comment - In the partial implementation, the input of the program is the data and T (number of trees). The data is split up between the mappers, but how many trees each mapper should build ? I've got two ideas: [easiest] each mapper builds T trees on its subset of the data, this makes it easy to configure how many trees each mapper builds but its somewhat tricky to estimate the total number of trees because it will depend on FileInputFormat.getsplits() (min split size, block size, data size...) each mapper builds T/M trees where M is the number of mappers available. The user sets the total number of trees, and the number of trees that each mapper builds will depend on the number of splits any suggestion ?
          Hide
          Deneche A. Hakim added a comment -

          to be able to predict the class of an out-of-bag instance, one must classify it using all the trees of the forest, and because each mapper has access to a subset of the trees, a second job is needed. Unless of course I'm missing something.

          I already implemented the first job, now I should start on the second.

          Show
          Deneche A. Hakim added a comment - to be able to predict the class of an out-of-bag instance, one must classify it using all the trees of the forest, and because each mapper has access to a subset of the trees, a second job is needed. Unless of course I'm missing something. I already implemented the first job, now I should start on the second.
          Hide
          Deneche A. Hakim added a comment -

          partial-mapred implementation

          changes

          • abstract class org.mahout.rf.mapred.Builder : Base class for Mapred Random Forest builders. Takes care of storing the parameters common to the mapred implementations: tree builder, data path, dataset path and seed. The child classes must implement at least :
            • void configureJob(JobConf) : to further configure the job before its launch; and
            • RandomForest parseOutput(JobConf, PredictionCallback) in order to convert the job outputs into a RandomForest and its corresponding oob predictions
          • abstract class org.mahout.rf.mapred.MapredMapper : Base class for Mapred mappers. Loads common parameters from the job
          • org.mahout.rf.mapred.examples.BuildForest : can now build a forest using either the in-mem or partial implementations (mapred or sequential)
            has also a special mode (-c command-line option) that checks if the results of the mapred vs. sequential implementations are the same, I use it to test the implementations
            because when using JUnit Hadoop uses a Local runner with just one mapper
          • one important change concerns the Dataset class. This class describes the data attributes. I added a tool (org.apache.mahout.rf.tools.Describe) that takes a data path, and a weird description string then it generates a Dataset and stores it in a file. This file is then passed to the various builders allowing them to convert the data instances in the fly. For example, the KDD description is : "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" (I told you, its weird!!!) that means that :
            • the first attribute is Numerical
            • the 3 next attributes are Categorical
            • the 2 next attributes are Numerical
            • ...
            • the last attribute is the Label

          package org.apache.mahout.rf.mapred.partial

          • InterResults : Utility class that stores/loads the intermediate results passed from the 1st to the 2nd step of the partial implementation
          • PartialBuilder : inherits from Builder and builds the forest by splitting the data to the mappers. Runs in two steps:
            • in the first step each mapper receives a subset of the data with its input split, builds a given number of trees, returning each tree with the classifications of the instances of the mapper's split that are oob;
            • in the second step each mapper receives the trees generated by the first step and computes for each tree, that does not belong to the mapper's partition, the classifications of all the instances of the mapper's split
              PartialBuilder goes through the final step results and passes the classifications to a given PredictionCallback, allowing the calling code to compute the oob error estimate.
          • Step1Mapper : First step mapper. Builds the trees using the data available in the InputSplit. Predict the oob classes for each tree in its growing partition (input split).
          • PartialSequentialBuilder : Simulates the Partial mapreduce implementation in a sequential manner, useful when testing the implementation performances
          • Step2Job : 2nd step of the partial mapreduce builder. Computes the oob predictions using all the trees of the forest
          • Step2Mapper : Second step mapper. Using the trees of the first step, computes the oob predictions for each tree, except those of its own partition, on all instancesof the partition.
          • TreeID: inherits from LongWritable, allows to combine a partition integer and a treeId integer into a single LongWritable. Used by the first and second step to identify uniquely each tree of the forest and to wich partition it belongs.
          Show
          Deneche A. Hakim added a comment - partial-mapred implementation changes abstract class org.mahout.rf.mapred.Builder : Base class for Mapred Random Forest builders. Takes care of storing the parameters common to the mapred implementations: tree builder, data path, dataset path and seed. The child classes must implement at least : void configureJob(JobConf) : to further configure the job before its launch; and RandomForest parseOutput(JobConf, PredictionCallback) in order to convert the job outputs into a RandomForest and its corresponding oob predictions abstract class org.mahout.rf.mapred.MapredMapper : Base class for Mapred mappers. Loads common parameters from the job org.mahout.rf.mapred.examples.BuildForest : can now build a forest using either the in-mem or partial implementations (mapred or sequential) has also a special mode (-c command-line option) that checks if the results of the mapred vs. sequential implementations are the same, I use it to test the implementations because when using JUnit Hadoop uses a Local runner with just one mapper one important change concerns the Dataset class. This class describes the data attributes. I added a tool (org.apache.mahout.rf.tools.Describe) that takes a data path, and a weird description string then it generates a Dataset and stores it in a file. This file is then passed to the various builders allowing them to convert the data instances in the fly. For example, the KDD description is : "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" (I told you, its weird!!!) that means that : the first attribute is Numerical the 3 next attributes are Categorical the 2 next attributes are Numerical ... the last attribute is the Label package org.apache.mahout.rf.mapred.partial InterResults : Utility class that stores/loads the intermediate results passed from the 1st to the 2nd step of the partial implementation PartialBuilder : inherits from Builder and builds the forest by splitting the data to the mappers. Runs in two steps: in the first step each mapper receives a subset of the data with its input split, builds a given number of trees, returning each tree with the classifications of the instances of the mapper's split that are oob; in the second step each mapper receives the trees generated by the first step and computes for each tree, that does not belong to the mapper's partition, the classifications of all the instances of the mapper's split PartialBuilder goes through the final step results and passes the classifications to a given PredictionCallback, allowing the calling code to compute the oob error estimate. Step1Mapper : First step mapper. Builds the trees using the data available in the InputSplit. Predict the oob classes for each tree in its growing partition (input split). PartialSequentialBuilder : Simulates the Partial mapreduce implementation in a sequential manner, useful when testing the implementation performances Step2Job : 2nd step of the partial mapreduce builder. Computes the oob predictions using all the trees of the forest Step2Mapper : Second step mapper. Using the trees of the first step, computes the oob predictions for each tree, except those of its own partition, on all instancesof the partition. TreeID: inherits from LongWritable, allows to combine a partition integer and a treeId integer into a single LongWritable. Used by the first and second step to identify uniquely each tree of the forest and to wich partition it belongs.
          Hide
          Deneche A. Hakim added a comment -

          I'm running some tests to compare between the in-mem and partial implementations. Here are the first results from my laptop (hadoop 0.19.1 in pseudo-distributed with 2 cores processor):

          All the tests are using a random seed = 1 and only one random feature is selected at a time.

          KDD 1%

          Num Map Tasks Num trees In-Mem build time Partial build time In-Mem oob error Partial oob error
          2 10 0h 0m 21s 5 0h 0m 31s 823 8.38E-4 0.43
          2 100 0h 0m 57s 641 0h 0m 44s 43 4.45E-4 0.42
          2 200 0h 1m 38s 307 0h 1m 4s 523 4.45E-4 0.43
          2 400 0h 3m 5s 883 0h 1m 43s 852 4.65E-4 0.42
          5 10 0h 0m 28s 404 0h 0m 33s 374 8.38E-4 0.32
          5 100 0h 1m 12s 260 0h 0m 43s 628 4.65E-4 0.34
          5 200 0h 2m 0s 293 0h 0m 47s 994 4.45E-4 0.34
          5 400 0h 3m 28s 69 0h 1m 4s 351 4.65E-4 0.34
          10 10 0h 0m 42s 654 0h 0m 49s 785 7.98E-4 0.23
          10 100 0h 1m 19s 405 0h 0m 53s 646 4.45E-4 0.23
          10 200 0h 2m 6s 375 0h 0m 56s 89 4.65E-4 0.23
          10 400 0h 3m 33s 253 0h 1m 8s 29 4.45E-4 0.23
          20 10        
          20 100 0h 2m 21s 762 0h 1m 23s 883 4.04E-4 0.23
          20 200 0h 2m 32s 952 0h 1m 22s 12 4.45E-4 0.23
          20 400 0h 4m 4s 487 0h 1m 31s 248 4.25E-4 0.23
          50 10        
          50 100 0h 3m 15s 485 0h 2m 53s 70 4.25E-4 0.23
          50 200 0h 4m 2s 509 0h 2m 51s 733 4.45E-4 0.23
          50 400 0h 5m 27s 252 0h 3m 7s 542 4.25E-4 0.23
          Show
          Deneche A. Hakim added a comment - I'm running some tests to compare between the in-mem and partial implementations. Here are the first results from my laptop (hadoop 0.19.1 in pseudo-distributed with 2 cores processor): All the tests are using a random seed = 1 and only one random feature is selected at a time. KDD 1% Num Map Tasks Num trees In-Mem build time Partial build time In-Mem oob error Partial oob error 2 10 0h 0m 21s 5 0h 0m 31s 823 8.38E-4 0.43 2 100 0h 0m 57s 641 0h 0m 44s 43 4.45E-4 0.42 2 200 0h 1m 38s 307 0h 1m 4s 523 4.45E-4 0.43 2 400 0h 3m 5s 883 0h 1m 43s 852 4.65E-4 0.42 5 10 0h 0m 28s 404 0h 0m 33s 374 8.38E-4 0.32 5 100 0h 1m 12s 260 0h 0m 43s 628 4.65E-4 0.34 5 200 0h 2m 0s 293 0h 0m 47s 994 4.45E-4 0.34 5 400 0h 3m 28s 69 0h 1m 4s 351 4.65E-4 0.34 10 10 0h 0m 42s 654 0h 0m 49s 785 7.98E-4 0.23 10 100 0h 1m 19s 405 0h 0m 53s 646 4.45E-4 0.23 10 200 0h 2m 6s 375 0h 0m 56s 89 4.65E-4 0.23 10 400 0h 3m 33s 253 0h 1m 8s 29 4.45E-4 0.23 20 10         20 100 0h 2m 21s 762 0h 1m 23s 883 4.04E-4 0.23 20 200 0h 2m 32s 952 0h 1m 22s 12 4.45E-4 0.23 20 400 0h 4m 4s 487 0h 1m 31s 248 4.25E-4 0.23 50 10         50 100 0h 3m 15s 485 0h 2m 53s 70 4.25E-4 0.23 50 200 0h 4m 2s 509 0h 2m 51s 733 4.45E-4 0.23 50 400 0h 5m 27s 252 0h 3m 7s 542 4.25E-4 0.23
          Hide
          Deneche A. Hakim added a comment -

          more tests on my laptop:

          KDD 10%

          Num Map Tasks Num trees In-Mem build time Partial build time In-Mem oob error Partial oob error
          2 10 0h 2m 44s 635 0h 1m 37s 249 3.11E-4 0.63
          2 100 0h 11m 57s 389 0h 5m 52s 22 2.63E-4 0.63
          2 200 0h 24m 17s 81 0h 10m 46s 735 2.65E-4 0.63
          2 400 0h 47m 24s 519 0h 21m 28s 939 2.57E-4 0.63
          5 10 0h 2m 19s 742 0h 0m 59s 211 4.92E-4 0.58
          5 100 0h 14m 10s 964 0h 2m 32s 969 2.42E-4 0.58
          5 200 0h 27m 12s 29 0h 4m 18s 984 2.59E-4 0.58
          5 400 0h 52m 29s 179 0h 8m 9s 980 2.42E-4 0.58
          10 10 0h 3m 8s 587 0h 1m 12s 826 5.41E-4 0.50
          10 100 0h 13m 42s 344 0h 2m 10s 523 2.63E-4 0.54
          10 200 0h 24m 22s 871 0h 3m 0s 816 2.57E-4 0.51
          10 400 0h 49m 39s 381 0h 4m 56s 698 2.53E-4 0.51
          20 10        
          20 100 0h 15m 20s 24 0h 2m 34s 573 2.42E-4 0.45
          20 200 0h 29m 43s 385 0h 3m 7s 545 2.55E-4 0.45
          20 400 0h 50m 43s 957 0h 4m 12s 662 2.55E-4 0.45
          50 10        
          50 100 0h 20m 35s 45 0h 3m 52s 244 2.46E-4 0.43
          50 200 0h 32m 26s 342 0h 4m 24s 853 2.48E-4 0.43
          50 400 0h 55m 28s 281 0h 5m 5s 999 2.51E-4 0.43
          Show
          Deneche A. Hakim added a comment - more tests on my laptop: KDD 10% Num Map Tasks Num trees In-Mem build time Partial build time In-Mem oob error Partial oob error 2 10 0h 2m 44s 635 0h 1m 37s 249 3.11E-4 0.63 2 100 0h 11m 57s 389 0h 5m 52s 22 2.63E-4 0.63 2 200 0h 24m 17s 81 0h 10m 46s 735 2.65E-4 0.63 2 400 0h 47m 24s 519 0h 21m 28s 939 2.57E-4 0.63 5 10 0h 2m 19s 742 0h 0m 59s 211 4.92E-4 0.58 5 100 0h 14m 10s 964 0h 2m 32s 969 2.42E-4 0.58 5 200 0h 27m 12s 29 0h 4m 18s 984 2.59E-4 0.58 5 400 0h 52m 29s 179 0h 8m 9s 980 2.42E-4 0.58 10 10 0h 3m 8s 587 0h 1m 12s 826 5.41E-4 0.50 10 100 0h 13m 42s 344 0h 2m 10s 523 2.63E-4 0.54 10 200 0h 24m 22s 871 0h 3m 0s 816 2.57E-4 0.51 10 400 0h 49m 39s 381 0h 4m 56s 698 2.53E-4 0.51 20 10         20 100 0h 15m 20s 24 0h 2m 34s 573 2.42E-4 0.45 20 200 0h 29m 43s 385 0h 3m 7s 545 2.55E-4 0.45 20 400 0h 50m 43s 957 0h 4m 12s 662 2.55E-4 0.45 50 10         50 100 0h 20m 35s 45 0h 3m 52s 244 2.46E-4 0.43 50 200 0h 32m 26s 342 0h 4m 24s 853 2.48E-4 0.43 50 400 0h 55m 28s 281 0h 5m 5s 999 2.51E-4 0.43
          Hide
          Ted Dunning added a comment -

          Ouch!

          Num Map Tasks Num trees In-Mem build time Partial build time In-Mem oob error Partial oob error
          ...
          2 100 0h 0m 57s 641 0h 0m 44s 43 4.45E-4 0.42
          ...
          10 400 0h 3m 33s 253 0h 1m 8s 29 4.45E-4 0.23

          This looks like it runs faster (or at least not much slower), but produces astronomically worse results.

          What really bugs me is that it is worse with few maps. Am I interpreting this correctly when I say that splitting the data in half and building independent forests increases OOB errors by a factor of 1000? How could that possibly be?

          Show
          Ted Dunning added a comment - Ouch! Num Map Tasks Num trees In-Mem build time Partial build time In-Mem oob error Partial oob error ... 2 100 0h 0m 57s 641 0h 0m 44s 43 4.45E-4 0.42 ... 10 400 0h 3m 33s 253 0h 1m 8s 29 4.45E-4 0.23 This looks like it runs faster (or at least not much slower), but produces astronomically worse results. What really bugs me is that it is worse with few maps. Am I interpreting this correctly when I say that splitting the data in half and building independent forests increases OOB errors by a factor of 1000? How could that possibly be?
          Hide
          Deneche A. Hakim added a comment -

          What really bugs me is that it is worse with few maps. Am I interpreting this correctly when I say that splitting the data in half and building independent forests increases OOB errors by a factor of 1000? How could that possibly be?

          Only one possible explanation: a BUG. I already have an idea where I can find it...

          Show
          Deneche A. Hakim added a comment - What really bugs me is that it is worse with few maps. Am I interpreting this correctly when I say that splitting the data in half and building independent forests increases OOB errors by a factor of 1000? How could that possibly be? Only one possible explanation: a BUG. I already have an idea where I can find it...
          Hide
          Deneche A. Hakim added a comment -

          as expected I found a bug and removed it. I then launched another batch of tests on my laptop:

          Num Map Tasks Num Trees Partial oob error
          2 100 0.043
          2 400 0.033
          10 100 0.051
          10 400 0.051
          50 100 0.43
          50 400 0.43

          as I said in a previous comment, Partial Builder uses two step to complete its job:

          • In The first step each mapper builds a number of trees using the subset of data available in its partition. If there are P partitions, and because of the bagging, each tree is built using about 2/(3 x P) of the data.
          • because all the instances that don't belong the a tree's partition can be considered as oob, a second step is used to complete the oob computation. Thus each tree is tested against 1 - 2/(3 x P) of the data

          Using only the first step, I got the following results:

          Num Map Tasks Num Trees Partial oob error
          2 100 2.85E-4
          2 400 2.67E-4
          10 100 4.88E-4
          10 400 2.81E-4
          50 100 7.19E-4
          50 400 5.46E-4

          Although the second step passes the unit tests, there is a possibility of a bug hiding somewhere. I'm going to use the reference implementation and run it on subsets of the data and use the forests to classify the whole data in the same way Partial Builder does, this should confirm if there is a bug or not.

          Show
          Deneche A. Hakim added a comment - as expected I found a bug and removed it. I then launched another batch of tests on my laptop: Num Map Tasks Num Trees Partial oob error 2 100 0.043 2 400 0.033 10 100 0.051 10 400 0.051 50 100 0.43 50 400 0.43 as I said in a previous comment, Partial Builder uses two step to complete its job: In The first step each mapper builds a number of trees using the subset of data available in its partition. If there are P partitions, and because of the bagging, each tree is built using about 2/(3 x P) of the data. because all the instances that don't belong the a tree's partition can be considered as oob, a second step is used to complete the oob computation. Thus each tree is tested against 1 - 2/(3 x P) of the data Using only the first step, I got the following results: Num Map Tasks Num Trees Partial oob error 2 100 2.85E-4 2 400 2.67E-4 10 100 4.88E-4 10 400 2.81E-4 50 100 7.19E-4 50 400 5.46E-4 Although the second step passes the unit tests, there is a possibility of a bug hiding somewhere. I'm going to use the reference implementation and run it on subsets of the data and use the forests to classify the whole data in the same way Partial Builder does, this should confirm if there is a bug or not.
          Hide
          Deneche A. Hakim added a comment -

          Ok here what I did:

          • Load KDD 10%
          • partition the data among P partitions
          • for each partition (p) run the ref. implementation builder, we get a forest Fp and a set of predictions Cp
          • for each partition (p)
            • for each forest Fk where k <> p, classify the instances of partition p and update Cp
          • compute the oob

          and launched the test on num trees = 100, and num maps = 2, 10, 50 and got almost exactly the same results as Partial Builder...Conclusion there is no visible bug in the second step of Partial Builder.

          Show
          Deneche A. Hakim added a comment - Ok here what I did: Load KDD 10% partition the data among P partitions for each partition (p) run the ref. implementation builder, we get a forest Fp and a set of predictions Cp for each partition (p) for each forest Fk where k <> p, classify the instances of partition p and update Cp compute the oob and launched the test on num trees = 100, and num maps = 2, 10, 50 and got almost exactly the same results as Partial Builder...Conclusion there is no visible bug in the second step of Partial Builder.
          Hide
          Ted Dunning added a comment -

          So at this point, it seems that you

          • have demonstrated that partitioning works to produce a usable forest because your errors on the partitioned forest seem similar
          • have demonstrated substantial speedup for large numbers of trees

          Is this correct?

          Show
          Ted Dunning added a comment - So at this point, it seems that you have demonstrated that partitioning works to produce a usable forest because your errors on the partitioned forest seem similar have demonstrated substantial speedup for large numbers of trees Is this correct?
          Hide
          Deneche A. Hakim added a comment -

          have demonstrated that partitioning works to produce a usable forest because your errors on the partitioned forest seem similar

          Yep, The last test shows that the results of the partial implementations are correct...or that the reference implementation is wrong, but I'm not considering this possibility (just kidding my first tests on the ref. impl. gave similar results to Breinman's paper)

          have demonstrated substantial speedup for large numbers of trees

          Oh yeah it fast, the partial implementation running on my laptop is two times faster than the in-mem implementation running on a 10 nodes cluster !!!
          but its oob error is not so good. I should use a larger dataset (why not KDD 100%) with more trees and see what happens.

          Actually there is a performance issue that I got using KDD 25% (hum...using bigger datasets seems to bring bigger problems). It should take a day or two to resolve.

          Show
          Deneche A. Hakim added a comment - have demonstrated that partitioning works to produce a usable forest because your errors on the partitioned forest seem similar Yep, The last test shows that the results of the partial implementations are correct...or that the reference implementation is wrong, but I'm not considering this possibility (just kidding my first tests on the ref. impl. gave similar results to Breinman's paper) have demonstrated substantial speedup for large numbers of trees Oh yeah it fast, the partial implementation running on my laptop is two times faster than the in-mem implementation running on a 10 nodes cluster !!! but its oob error is not so good. I should use a larger dataset (why not KDD 100%) with more trees and see what happens. Actually there is a performance issue that I got using KDD 25% (hum...using bigger datasets seems to bring bigger problems). It should take a day or two to resolve.
          Hide
          Deneche A. Hakim added a comment -
          • resolved a bug in Partial Implementation

          This patch includes MAHOUT-140 and MAHOUT-122 .

          Show
          Deneche A. Hakim added a comment - resolved a bug in Partial Implementation This patch includes MAHOUT-140 and MAHOUT-122 .
          Hide
          Deneche A. Hakim added a comment -

          changes

          Partial Implementation has been improved to work better with larger datasets, I'm now able to deal with KDD 50% on EC2.

          Show
          Deneche A. Hakim added a comment - changes Partial Implementation has been improved to work better with larger datasets, I'm now able to deal with KDD 50% on EC2.
          Hide
          Deneche A. Hakim added a comment -

          Here are some results from a 10 nodes cluster (c1.medium):

          Dataset Num Map Tasks Num Trees Build Time oob error
          KDD 10% 10 400 0h 1m 46s 19 0.051
          KDD 10% 20 400 0h 1m 15s 571 0.090
          KDD 10% 50 400 0h 1m 46s 19 0.051
          KDD 25% 10 100 0h 1m 18s 574 0.43
          KDD 25% 10 400 0h 4m 9s 999 0.019
          KDD 25% 20 400 0h 2m 42s 293 0.50

          having some heap size issues, I set HADOOP_HEAPSIZE=2000 for the next tests:

          Dataset Num Map Tasks Num Trees Build Time oob error
          KDD 50% 10 100 0h 1m 52s 338 0.19
          KDD 50% 20 400 0h 5m 54s 961 0.18
          KDD 50% 50 400 0h 4m 18s 861 0.47

          For now I'm not able to process KDD 100% because a limitation in my code. The Partial Builder takes 6 minutes to build 100 with 10 maps, but the example program hangs when comparing the forest predictions with the data labels, because the current example code loads the whole dataset in memory before checking the labels =P

          Show
          Deneche A. Hakim added a comment - Here are some results from a 10 nodes cluster (c1.medium): Dataset Num Map Tasks Num Trees Build Time oob error KDD 10% 10 400 0h 1m 46s 19 0.051 KDD 10% 20 400 0h 1m 15s 571 0.090 KDD 10% 50 400 0h 1m 46s 19 0.051 KDD 25% 10 100 0h 1m 18s 574 0.43 KDD 25% 10 400 0h 4m 9s 999 0.019 KDD 25% 20 400 0h 2m 42s 293 0.50 having some heap size issues, I set HADOOP_HEAPSIZE=2000 for the next tests: Dataset Num Map Tasks Num Trees Build Time oob error KDD 50% 10 100 0h 1m 52s 338 0.19 KDD 50% 20 400 0h 5m 54s 961 0.18 KDD 50% 50 400 0h 4m 18s 861 0.47 For now I'm not able to process KDD 100% because a limitation in my code. The Partial Builder takes 6 minutes to build 100 with 10 maps, but the example program hangs when comparing the forest predictions with the data labels, because the current example code loads the whole dataset in memory before checking the labels =P
          Hide
          Ted Dunning added a comment -

          These are confusing numbers. First, why does the number of trees vary like this?

          Secondly, the oob error jumps around a lot in confusing ways.

          Thirdly, the times don't seem to match what I would expect. Moreover, KDD10 at 10 and 50 map tasks take exactly the same amount of time.

          My expectation would have been that running 20 map tasks would do almost twice as well as running 10 because we have 10 machines each of which is dual core. Running 50 map tasks should be about the same as 20. We see that pattern on KDD25 except we don't have a datapoint for 50 maps.

          Show
          Ted Dunning added a comment - These are confusing numbers. First, why does the number of trees vary like this? Secondly, the oob error jumps around a lot in confusing ways. Thirdly, the times don't seem to match what I would expect. Moreover, KDD10 at 10 and 50 map tasks take exactly the same amount of time. My expectation would have been that running 20 map tasks would do almost twice as well as running 10 because we have 10 machines each of which is dual core. Running 50 map tasks should be about the same as 20. We see that pattern on KDD25 except we don't have a datapoint for 50 maps.
          Hide
          Deneche A. Hakim added a comment -

          These are confusing numbers. First, why does the number of trees vary like this?

          Hmm...well...my primary focus was to check if the implementation was able to handle larger datasets. I shall run another, more coherent, batch of tests soon

          Secondly, the oob error jumps around a lot in confusing ways.

          Thirdly, the times don't seem to match what I would expect. Moreover, KDD10 at 10 and 50 map tasks take exactly the same amount of time.

          Ouch It's a copy and paste brain-bug !!! Ok, I'll be more careful with the next test

          My expectation would have been that running 20 map tasks would do almost twice as well as running 10 because we have 10 machines each of which is dual core. Running 50 map tasks should be about the same as 20. We see that pattern on KDD25 except we don't have a datapoint for 50 maps.

          Re-Ouch, I used the same cofiguration that I used with the In-Mem implementation: mapred.tasktracker.map.tasks.maximum=1 only one mapper at a time on each node

          Show
          Deneche A. Hakim added a comment - These are confusing numbers. First, why does the number of trees vary like this? Hmm...well...my primary focus was to check if the implementation was able to handle larger datasets. I shall run another, more coherent, batch of tests soon Secondly, the oob error jumps around a lot in confusing ways. Thirdly, the times don't seem to match what I would expect. Moreover, KDD10 at 10 and 50 map tasks take exactly the same amount of time. Ouch It's a copy and paste brain-bug !!! Ok, I'll be more careful with the next test My expectation would have been that running 20 map tasks would do almost twice as well as running 10 because we have 10 machines each of which is dual core. Running 50 map tasks should be about the same as 20. We see that pattern on KDD25 except we don't have a datapoint for 50 maps. Re-Ouch, I used the same cofiguration that I used with the In-Mem implementation: mapred.tasktracker.map.tasks.maximum=1 only one mapper at a time on each node
          Hide
          Deneche A. Hakim added a comment -

          How the Partial Mapred builder works:

          • step 0 (centralized): the main program prepares and launches the builder
          • step 1 (mapred job): each mapper builds a set of trees and classifies the oob instances of the partition, return each tree with the classifications of all partition instances (non classified instance get -1)
          • step 1-2 (centralized): the builder processes the outputs of the job two times:
            • the first time in order to compute the partitions' sizes and their respective order
            • the second time to extract the trees and pass the oob classifications to a callback
              this step has been split to avoid loading all the outputs in memory (slows down the program when the data is large)
          • step 2 (mapred job): each mapper uses all the trees of the other partitions to compute the classifications for all the instances of its partition. This completes the oob error computation
          • step 2-2 (centralized): the builder processes the outputs and passes the oob classifications to a callback
          • step 3 (centralized): the main program receives the decision forest, and its callback has received all the oob classifications. In order to compute the oob error it must compare the oob classifications with the real data labels. Actually its done by loading the whole data in memory (ouch!), extracting its labels, then computing the oob error

          in the test results the build time is the time taken by the steps 1, 1-2, 2 and 2-2. Although the step 3 is not accounted, it slows the tests so much that I was not able to try KDD 100%.

          In the following results, the build time is computed by the program, and I was able to figure out the other times using the log of the program.

          EC2 10 nodes (c1.medium) cluster
          mapred.tasktracker.map.tasks.maximum=2
          mapred.child.java.opts=-Xms500m -Xmx1000m
          export HADOOP_HEAPSIZE=2000

          seed 1, m 1, oob

          KDD 10%

          Num Map Tasks Num Trees Oob Error Build Time Step 1 Step 1-2 Step 2 Step 2-2 Step 3
          10 100 0.0515 0h 0m 48s 823 24s 2s 15s 7s 14s
          10 200 0.0514 0h 0m 59s 34 27s 3s 15s 14s 13s
          10 400 0.0513 0h 1m 40s 265 43s 7s 22s 28s 13s
          20 100 0.0864 0h 0m 37s 366 15s 1s 14s 7s 14s
          20 200 0.1024 0h 0m 47s 213 14s 2s 17s 14s 13s
          20 400 0.0903 0h 1m 14s 368 18s 4s 22s 30s 13s
          50 100 0.4315 0h 0m 37s 657 13s 1s 16s 8s 14s
          50 200 0.4316 0h 0m 48s 611 15s 2s 16s 15s 14s
          50 400 0.4316 0h 1m 6s 160 14s 2s 21s 30s 12s

          As soon as I compile the results of KDD50 and KDD100 I'll post them, then I can start explaining those results (at least I will try)

          Show
          Deneche A. Hakim added a comment - How the Partial Mapred builder works: step 0 (centralized): the main program prepares and launches the builder step 1 (mapred job): each mapper builds a set of trees and classifies the oob instances of the partition, return each tree with the classifications of all partition instances (non classified instance get -1) step 1-2 (centralized): the builder processes the outputs of the job two times: the first time in order to compute the partitions' sizes and their respective order the second time to extract the trees and pass the oob classifications to a callback this step has been split to avoid loading all the outputs in memory (slows down the program when the data is large) step 2 (mapred job): each mapper uses all the trees of the other partitions to compute the classifications for all the instances of its partition. This completes the oob error computation step 2-2 (centralized): the builder processes the outputs and passes the oob classifications to a callback step 3 (centralized): the main program receives the decision forest, and its callback has received all the oob classifications. In order to compute the oob error it must compare the oob classifications with the real data labels. Actually its done by loading the whole data in memory (ouch!), extracting its labels, then computing the oob error in the test results the build time is the time taken by the steps 1, 1-2, 2 and 2-2. Although the step 3 is not accounted, it slows the tests so much that I was not able to try KDD 100%. In the following results, the build time is computed by the program, and I was able to figure out the other times using the log of the program. EC2 10 nodes (c1.medium) cluster mapred.tasktracker.map.tasks.maximum=2 mapred.child.java.opts=-Xms500m -Xmx1000m export HADOOP_HEAPSIZE=2000 seed 1, m 1, oob KDD 10% Num Map Tasks Num Trees Oob Error Build Time Step 1 Step 1-2 Step 2 Step 2-2 Step 3 10 100 0.0515 0h 0m 48s 823 24s 2s 15s 7s 14s 10 200 0.0514 0h 0m 59s 34 27s 3s 15s 14s 13s 10 400 0.0513 0h 1m 40s 265 43s 7s 22s 28s 13s 20 100 0.0864 0h 0m 37s 366 15s 1s 14s 7s 14s 20 200 0.1024 0h 0m 47s 213 14s 2s 17s 14s 13s 20 400 0.0903 0h 1m 14s 368 18s 4s 22s 30s 13s 50 100 0.4315 0h 0m 37s 657 13s 1s 16s 8s 14s 50 200 0.4316 0h 0m 48s 611 15s 2s 16s 15s 14s 50 400 0.4316 0h 1m 6s 160 14s 2s 21s 30s 12s As soon as I compile the results of KDD50 and KDD100 I'll post them, then I can start explaining those results (at least I will try)
          Hide
          Deneche A. Hakim added a comment - - edited

          update: I did a re-run on 50 map tests, the new results are more coherent

          KDD 25%

          Num Map Tasks Num Trees Oob Error Build Time Step 1 Step 1-2 Step 2 Step 2-2 Step 3
          10 100 0.0194 0h 1m 23s 210 39s 4s 20s 20s 33s
          10 200 0.0203 0h 2m 16s 510 1m 1s 9s 26s 41s 33s
          10 400 0.0195 0h 4m 10s 9 1m 53s 18s 39s 1m 20s 32s
          20 100 0.3875 0h 1m 5s 288 20s 2s 18s 25s 31s
          20 200 0.3626 0h 1m 29s 145 23s 5s 22s 39s 33s
          20 400 0.5003 0h 2m 30s 789 35s 8s 28s 1m 19s 32s
          50 100 0.5041 0h 1m 1s 375 19s 3s 19s 21s 32s
          50 200 0.5041 0h 1m 19s 202 19s 2s 22s 36s 32s
          50 400 0.5041 0h 2m 2s 250 18s 4s 28s 1m 12s 33s
          Show
          Deneche A. Hakim added a comment - - edited update: I did a re-run on 50 map tests, the new results are more coherent KDD 25% Num Map Tasks Num Trees Oob Error Build Time Step 1 Step 1-2 Step 2 Step 2-2 Step 3 10 100 0.0194 0h 1m 23s 210 39s 4s 20s 20s 33s 10 200 0.0203 0h 2m 16s 510 1m 1s 9s 26s 41s 33s 10 400 0.0195 0h 4m 10s 9 1m 53s 18s 39s 1m 20s 32s 20 100 0.3875 0h 1m 5s 288 20s 2s 18s 25s 31s 20 200 0.3626 0h 1m 29s 145 23s 5s 22s 39s 33s 20 400 0.5003 0h 2m 30s 789 35s 8s 28s 1m 19s 32s 50 100 0.5041 0h 1m 1s 375 19s 3s 19s 21s 32s 50 200 0.5041 0h 1m 19s 202 19s 2s 22s 36s 32s 50 400 0.5041 0h 2m 2s 250 18s 4s 28s 1m 12s 33s
          Hide
          Deneche A. Hakim added a comment -

          KDD 50%

          Num Map Tasks Num Trees Oob Error Build Time Step 1 Step 1-2 Step 2 Step 2-2 Step 3
          10 100 0.1911 0h 2m 39s 73 1m 23s 9s 27s 40s 1m 7s
          10 200 0.1902 0h 4m 57s 268 2m 39s 17s 40s 1m 21s 1m 4s
          10 400 0.1880 0h 9m 1s 400 4m 37s 34s 1m 5s 2m 46s 1m 6s
          20 100 0.1905 0h 1m 44s 853 32s 5s 24s 44s 1m 5s
          20 200 0.1853 0h 2m 58s 462 48s 9s 30s 1m 32s 1m 3s
          20 400 0.1856 0h 5m 20s 231 1m 26s 17s 47s 2m 50s 1m 5s
          50 100 0.4738 0h 1m 23s 989 19s 2s 24s 39s 1m 3s
          50 200 0.4738 0h 2m 10s 921 21s 4s 30s 1m 16s 1m 3s
          50 400 0.4738 0h 3m 52s 98 25s 7s 44s 2m 36s 1m 2s
          Show
          Deneche A. Hakim added a comment - KDD 50% Num Map Tasks Num Trees Oob Error Build Time Step 1 Step 1-2 Step 2 Step 2-2 Step 3 10 100 0.1911 0h 2m 39s 73 1m 23s 9s 27s 40s 1m 7s 10 200 0.1902 0h 4m 57s 268 2m 39s 17s 40s 1m 21s 1m 4s 10 400 0.1880 0h 9m 1s 400 4m 37s 34s 1m 5s 2m 46s 1m 6s 20 100 0.1905 0h 1m 44s 853 32s 5s 24s 44s 1m 5s 20 200 0.1853 0h 2m 58s 462 48s 9s 30s 1m 32s 1m 3s 20 400 0.1856 0h 5m 20s 231 1m 26s 17s 47s 2m 50s 1m 5s 50 100 0.4738 0h 1m 23s 989 19s 2s 24s 39s 1m 3s 50 200 0.4738 0h 2m 10s 921 21s 4s 30s 1m 16s 1m 3s 50 400 0.4738 0h 3m 52s 98 25s 7s 44s 2m 36s 1m 2s
          Hide
          Deneche A. Hakim added a comment -

          Preparing the code for GSoC deadline

          • DONE: move rf.RFUtils.storeWritable() to rf.ref.tools.Describe, becomes private
          • DONE: rename rf.mapred.partial.InterResults.loadForest/storeForest to load/store
          • DONE: delete rf.mapred.partial.Step0Job and the corresponding tests
          • DONE: delete rf.ref.examples.DataSplit
          • DONE: DefaultTreeBuilder uses OptIgSplit by default
            • DONE: remove unnecessary calls to DefaultTreeBuilder.setIgSplit()
          Show
          Deneche A. Hakim added a comment - Preparing the code for GSoC deadline DONE: move rf.RFUtils.storeWritable() to rf.ref.tools.Describe, becomes private DONE: rename rf.mapred.partial.InterResults.loadForest/storeForest to load/store DONE: delete rf.mapred.partial.Step0Job and the corresponding tests DONE: delete rf.ref.examples.DataSplit DONE: DefaultTreeBuilder uses OptIgSplit by default DONE: remove unnecessary calls to DefaultTreeBuilder.setIgSplit()
          Hide
          Deneche A. Hakim added a comment -

          Preparing for GSoC deadline

          • DONE: move rf.mapred.xxx.xxxSequentialBuilder to core/tests/
            • DONE: move rf.mapred.examples.BuildForest to core/tests/, becomes df.mapred.tools.BuildForest
          • DONE: move a copy of rf.mapred.examples.BuildForest to examples/, no more calls xxxSequentialBuilder
          • DONE: rf.ref.examples.BreinmanExample uses CLI
          Show
          Deneche A. Hakim added a comment - Preparing for GSoC deadline DONE: move rf.mapred.xxx.xxxSequentialBuilder to core/tests/ DONE: move rf.mapred.examples.BuildForest to core/tests/, becomes df.mapred.tools.BuildForest DONE: move a copy of rf.mapred.examples.BuildForest to examples/, no more calls xxxSequentialBuilder DONE: rf.ref.examples.BreinmanExample uses CLI
          Hide
          Deneche A. Hakim added a comment -

          GSoC latest patch

          • DONE: move rf.ref.examples.BreimanExample to examples/
          • DONE: move rf.ref.examples.CpuTest to core/tests (tools package)
          • DONE: move rf.ref.examples.MemoryUsage to core/tests (tools package)
          • DONE: move rf.ref.examples.PartialStep2Test to core/tests (tools package), becomes PartialStep2Check
          • DONE: move content of rf.ref.examples.UciDescriptors to ExampleUtils
          • DONE: org.apache.mahout.rf becomes org.apache.mahout.df (Decision Forest)
          • DONE: Check that all files contain Apache License
          • DONE: add a link to Andrew's tutorial in DefaultTreeBuilder

          This should be the last patch concerning GSoC. The next ones will target the 0.2 release

          Show
          Deneche A. Hakim added a comment - GSoC latest patch DONE: move rf.ref.examples.BreimanExample to examples/ DONE: move rf.ref.examples.CpuTest to core/tests (tools package) DONE: move rf.ref.examples.MemoryUsage to core/tests (tools package) DONE: move rf.ref.examples.PartialStep2Test to core/tests (tools package), becomes PartialStep2Check DONE: move content of rf.ref.examples.UciDescriptors to ExampleUtils DONE: org.apache.mahout.rf becomes org.apache.mahout.df (Decision Forest) DONE: Check that all files contain Apache License DONE: add a link to Andrew's tutorial in DefaultTreeBuilder This should be the last patch concerning GSoC. The next ones will target the 0.2 release
          Hide
          Deneche A. Hakim added a comment -

          Preparation for mahout 0.2

          • moving to Hadoop 0.20.0 API:
            • org.apache.mahout.df.mapred.* contains the code compatible with Hadoop 0.19.1
            • org.apache.mahout.df.mapreduce.* will contain the code that uses Hadoop 0.20.0 API
            • the in-mem implementation has been converted to 0.20.0 and is working
            • the partial implementation still need a looot of work to do, but should be better (or more likely with better bugs)
          Show
          Deneche A. Hakim added a comment - Preparation for mahout 0.2 moving to Hadoop 0.20.0 API: org.apache.mahout.df.mapred.* contains the code compatible with Hadoop 0.19.1 org.apache.mahout.df.mapreduce.* will contain the code that uses Hadoop 0.20.0 API the in-mem implementation has been converted to 0.20.0 and is working the partial implementation still need a looot of work to do, but should be better (or more likely with better bugs)
          Hide
          Deneche A. Hakim added a comment -
          • DONE: partial implementation that uses Hadoop 0.20.0
          • TODO: convert the partial implementation tests to Hadoop 0.20.0
          Show
          Deneche A. Hakim added a comment - DONE: partial implementation that uses Hadoop 0.20.0 TODO: convert the partial implementation tests to Hadoop 0.20.0
          Hide
          Deneche A. Hakim added a comment -
          • DONE: convert the partial implementation tests to Hadoop 0.20.0
          • TODO: test the code on a Hadoop 0.20.0 cluster (EC2)
          Show
          Deneche A. Hakim added a comment - DONE: convert the partial implementation tests to Hadoop 0.20.0 TODO: test the code on a Hadoop 0.20.0 cluster (EC2)
          Hide
          Deneche A. Hakim added a comment -
          • Corrected some bugs in the new code when testing in a pseudo-distributed cluster
          Show
          Deneche A. Hakim added a comment - Corrected some bugs in the new code when testing in a pseudo-distributed cluster
          Hide
          Deneche A. Hakim added a comment - - edited

          * TODO: test the code on a Hadoop 0.20.0 cluster (EC2)

          Looks like I'll have to wait till Hadoop 0.20.1 to be able to test on EC2...after creating my own AMI (with a lot of pain, being a noob), I stumbled upon the following bug HADOOP-5921

          Show
          Deneche A. Hakim added a comment - - edited * TODO: test the code on a Hadoop 0.20.0 cluster (EC2) Looks like I'll have to wait till Hadoop 0.20.1 to be able to test on EC2...after creating my own AMI (with a lot of pain, being a noob), I stumbled upon the following bug HADOOP-5921
          Hide
          Deneche A. Hakim added a comment -

          What about using the Yahoo 0.20 distribution? (http://developer.yahoo.com/hadoop/distribution/ )

          Yahoo distribution did the job !

          I launched the tests on a 10-nodes cluster with KDD10, and apart from a difference in execution time, the 0.20.0 implementation uses one more step, the results are the same

          For now I'm not able to process KDD 100% because a limitation in my code. The Partial Builder takes 6 minutes to build 100 with 10 maps, but the example program hangs when comparing the forest predictions with the data labels, because the current example code loads the whole dataset in memory before checking the labels =P

          • TODO: no need to load the whole dataset in memory just to extract the labels, this should help when dealing with large datasets
          Show
          Deneche A. Hakim added a comment - What about using the Yahoo 0.20 distribution? ( http://developer.yahoo.com/hadoop/distribution/ ) Yahoo distribution did the job ! I launched the tests on a 10-nodes cluster with KDD10, and apart from a difference in execution time, the 0.20.0 implementation uses one more step, the results are the same For now I'm not able to process KDD 100% because a limitation in my code. The Partial Builder takes 6 minutes to build 100 with 10 maps, but the example program hangs when comparing the forest predictions with the data labels, because the current example code loads the whole dataset in memory before checking the labels =P TODO: no need to load the whole dataset in memory just to extract the labels, this should help when dealing with large datasets
          Hide
          Deneche A. Hakim added a comment -
          • Will be committed as part of MAHOUT-145
          • For now two implementations are available, one that uses Hadoop 0.20.0 API and one that don't use it. Later only one implementation should remain

          Important

          • one important part that is still missing is the integration of Decision Forests with Mahout's Classifiers. It should take some time, so the current code could be committed as it is (working but not yet integrated) and the integration will probably be available for Mahout 0.3.
          Show
          Deneche A. Hakim added a comment - Will be committed as part of MAHOUT-145 For now two implementations are available, one that uses Hadoop 0.20.0 API and one that don't use it. Later only one implementation should remain Important one important part that is still missing is the integration of Decision Forests with Mahout's Classifiers. It should take some time, so the current code could be committed as it is (working but not yet integrated) and the integration will probably be available for Mahout 0.3.
          Hide
          Deneche A. Hakim added a comment -
          • DONE: no need to load the whole dataset in memory just to extract the labels, this should help when dealing with large datasets
          Show
          Deneche A. Hakim added a comment - DONE: no need to load the whole dataset in memory just to extract the labels, this should help when dealing with large datasets
          Hide
          Deneche A. Hakim added a comment -
          • This patch also includes MAHOUT-140 and MAHOUT-122.
          • in-mem and partial implementations are available for Hadoop 0.19.1 (org.apache.mahout.df.mapred.*) and Hadoop 0.20.0 (org.apache.mahout.df.mapreduce)
          • this code is not yet integrated with mahout's classifiers. I shall start on it, but not in time for mahout 0.2.0
          Show
          Deneche A. Hakim added a comment - This patch also includes MAHOUT-140 and MAHOUT-122 . in-mem and partial implementations are available for Hadoop 0.19.1 (org.apache.mahout.df.mapred.*) and Hadoop 0.20.0 (org.apache.mahout.df.mapreduce) this code is not yet integrated with mahout's classifiers. I shall start on it, but not in time for mahout 0.2.0
          Hide
          Deneche A. Hakim added a comment -

          committed patch

          Show
          Deneche A. Hakim added a comment - committed patch
          Hide
          Sara Del Río García added a comment -

          Hello Deneche A. Hakim:

          I'm testing the Random Forest Partial version in the version of Hadoop: Hadoop 2.0.0-cdh4.1.1

          I'm trying to modify the algorithm, all I do is add more information to the leaves of the tree. Currently containing the label and I want to add another label more:

          @Override
          public void readFields(DataInput in) throws IOException

          { label = in.readDouble(); leafWeight = in.readDouble(); }

          @Override
          protected void writeNode(DataOutput out) throws IOException

          { out.writeDouble(label); out.writeDouble(leafWeight); }

          And I get the following error:

          13/02/27 06:53:27 INFO mapreduce.BuildForest: Partial Mapred implementation
          13/02/27 06:53:27 INFO mapreduce.BuildForest: Building the forest...
          13/02/27 06:53:27 INFO mapreduce.BuildForest: Weights Estimation: IR
          13/02/27 06:53:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
          13/02/27 06:53:39 INFO input.FileInputFormat: Total input paths to process : 1
          13/02/27 06:53:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
          13/02/27 06:53:39 WARN snappy.LoadSnappy: Snappy native library not loaded
          13/02/27 06:53:39 INFO mapred.JobClient: Running job: job_201302270205_0013
          13/02/27 06:53:40 INFO mapred.JobClient: map 0% reduce 0%
          13/02/27 06:54:18 INFO mapred.JobClient: map 20% reduce 0%
          13/02/27 06:54:42 INFO mapred.JobClient: map 40% reduce 0%
          13/02/27 06:55:03 INFO mapred.JobClient: map 60% reduce 0%
          13/02/27 06:55:26 INFO mapred.JobClient: map 70% reduce 0%
          13/02/27 06:55:27 INFO mapred.JobClient: map 80% reduce 0%
          13/02/27 06:55:49 INFO mapred.JobClient: map 100% reduce 0%
          13/02/27 06:56:04 INFO mapred.JobClient: Job complete: job_201302270205_0013
          13/02/27 06:56:04 INFO mapred.JobClient: Counters: 24
          13/02/27 06:56:04 INFO mapred.JobClient: File System Counters
          13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes read=0
          13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes written=1828230
          13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of read operations=0
          13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of large read operations=0
          13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of write operations=0
          13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes read=1381649
          13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes written=1680
          13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of read operations=30
          13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of large read operations=0
          13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of write operations=10
          13/02/27 06:56:04 INFO mapred.JobClient: Job Counters
          13/02/27 06:56:04 INFO mapred.JobClient: Launched map tasks=10
          13/02/27 06:56:04 INFO mapred.JobClient: Data-local map tasks=10
          13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=254707
          13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
          13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
          13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
          13/02/27 06:56:04 INFO mapred.JobClient: Map-Reduce Framework
          13/02/27 06:56:04 INFO mapred.JobClient: Map input records=20
          13/02/27 06:56:04 INFO mapred.JobClient: Map output records=10
          13/02/27 06:56:04 INFO mapred.JobClient: Input split bytes=1540
          13/02/27 06:56:04 INFO mapred.JobClient: Spilled Records=0
          13/02/27 06:56:04 INFO mapred.JobClient: CPU time spent (ms)=12070
          13/02/27 06:56:04 INFO mapred.JobClient: Physical memory (bytes) snapshot=949579776
          13/02/27 06:56:04 INFO mapred.JobClient: Virtual memory (bytes) snapshot=8412340224
          13/02/27 06:56:04 INFO mapred.JobClient: Total committed heap usage (bytes)=478412800
          READ
          nodetype: 0
          Exception in thread "main" java.lang.IllegalStateException: java.io.EOFException
          at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:104)
          at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
          at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
          at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
          at org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:129)
          at org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.parseOutput(PartialBuilder.java:96)
          at org.apache.mahout.classifier.df.mapreduce.Builder.build(Builder.java:312)
          at org.apache.mahout.classifier.df.mapreduce.BuildForest.buildForest(BuildForest.java:246)
          at org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:200)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
          at org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:270)
          Caused by: java.io.EOFException
          at java.io.DataInputStream.readFully(DataInputStream.java:180)
          at java.io.DataInputStream.readLong(DataInputStream.java:399)
          at java.io.DataInputStream.readDouble(DataInputStream.java:451)
          at org.apache.mahout.classifier.df.node.Leaf.readFields(Leaf.java:136)
          at org.apache.mahout.classifier.df.node.Node.read(Node.java:85)
          at org.apache.mahout.classifier.df.mapreduce.MapredOutput.readFields(MapredOutput.java:64)
          at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114)
          at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242)
          at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
          ... 10 more

          What's the problem?
          You can try to write something more in the leaves of the tree? Anything.

          Thank you very much.

          Best regards,

          Sara

          Show
          Sara Del Río García added a comment - Hello Deneche A. Hakim: I'm testing the Random Forest Partial version in the version of Hadoop: Hadoop 2.0.0-cdh4.1.1 I'm trying to modify the algorithm, all I do is add more information to the leaves of the tree. Currently containing the label and I want to add another label more: @Override public void readFields(DataInput in) throws IOException { label = in.readDouble(); leafWeight = in.readDouble(); } @Override protected void writeNode(DataOutput out) throws IOException { out.writeDouble(label); out.writeDouble(leafWeight); } And I get the following error: 13/02/27 06:53:27 INFO mapreduce.BuildForest: Partial Mapred implementation 13/02/27 06:53:27 INFO mapreduce.BuildForest: Building the forest... 13/02/27 06:53:27 INFO mapreduce.BuildForest: Weights Estimation: IR 13/02/27 06:53:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/02/27 06:53:39 INFO input.FileInputFormat: Total input paths to process : 1 13/02/27 06:53:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/02/27 06:53:39 WARN snappy.LoadSnappy: Snappy native library not loaded 13/02/27 06:53:39 INFO mapred.JobClient: Running job: job_201302270205_0013 13/02/27 06:53:40 INFO mapred.JobClient: map 0% reduce 0% 13/02/27 06:54:18 INFO mapred.JobClient: map 20% reduce 0% 13/02/27 06:54:42 INFO mapred.JobClient: map 40% reduce 0% 13/02/27 06:55:03 INFO mapred.JobClient: map 60% reduce 0% 13/02/27 06:55:26 INFO mapred.JobClient: map 70% reduce 0% 13/02/27 06:55:27 INFO mapred.JobClient: map 80% reduce 0% 13/02/27 06:55:49 INFO mapred.JobClient: map 100% reduce 0% 13/02/27 06:56:04 INFO mapred.JobClient: Job complete: job_201302270205_0013 13/02/27 06:56:04 INFO mapred.JobClient: Counters: 24 13/02/27 06:56:04 INFO mapred.JobClient: File System Counters 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes read=0 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of bytes written=1828230 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of read operations=0 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of large read operations=0 13/02/27 06:56:04 INFO mapred.JobClient: FILE: Number of write operations=0 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes read=1381649 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of bytes written=1680 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of read operations=30 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/02/27 06:56:04 INFO mapred.JobClient: HDFS: Number of write operations=10 13/02/27 06:56:04 INFO mapred.JobClient: Job Counters 13/02/27 06:56:04 INFO mapred.JobClient: Launched map tasks=10 13/02/27 06:56:04 INFO mapred.JobClient: Data-local map tasks=10 13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=254707 13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0 13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/02/27 06:56:04 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/02/27 06:56:04 INFO mapred.JobClient: Map-Reduce Framework 13/02/27 06:56:04 INFO mapred.JobClient: Map input records=20 13/02/27 06:56:04 INFO mapred.JobClient: Map output records=10 13/02/27 06:56:04 INFO mapred.JobClient: Input split bytes=1540 13/02/27 06:56:04 INFO mapred.JobClient: Spilled Records=0 13/02/27 06:56:04 INFO mapred.JobClient: CPU time spent (ms)=12070 13/02/27 06:56:04 INFO mapred.JobClient: Physical memory (bytes) snapshot=949579776 13/02/27 06:56:04 INFO mapred.JobClient: Virtual memory (bytes) snapshot=8412340224 13/02/27 06:56:04 INFO mapred.JobClient: Total committed heap usage (bytes)=478412800 READ nodetype: 0 Exception in thread "main" java.lang.IllegalStateException: java.io.EOFException at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:104) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:129) at org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilder.parseOutput(PartialBuilder.java:96) at org.apache.mahout.classifier.df.mapreduce.Builder.build(Builder.java:312) at org.apache.mahout.classifier.df.mapreduce.BuildForest.buildForest(BuildForest.java:246) at org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:200) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:270) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at java.io.DataInputStream.readDouble(DataInputStream.java:451) at org.apache.mahout.classifier.df.node.Leaf.readFields(Leaf.java:136) at org.apache.mahout.classifier.df.node.Node.read(Node.java:85) at org.apache.mahout.classifier.df.mapreduce.MapredOutput.readFields(MapredOutput.java:64) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95) ... 10 more What's the problem? You can try to write something more in the leaves of the tree? Anything. Thank you very much. Best regards, Sara

            People

            • Assignee:
              Deneche A. Hakim
              Reporter:
              Deneche A. Hakim
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development