Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ML
    • Labels:
      None

      Description

      This is a JIRA for exploring accuracy improvements for Random Forests.

      Background

      Initial exploration was based on reports of poor accuracy from http://datascience.la/benchmarking-random-forest-implementations/

      Essentially, Spark 1.2 showed poor performance relative to other libraries for training set sizes of 1M and 10M.

      Initial improvements

      The biggest issue was that the metric being used was AUC and Spark 1.2 was using hard predictions, not class probabilities. This was fixed in SPARK-9528, and that brought Spark up to performance parity with scikit-learn, Vowpal Wabbit, and R for the training set size of 1M.

      Remaining issues

      For training set size 10M, Spark does not yet match the AUC of the other 2 libraries benchmarked (H2O and xgboost).

      Note that, on 1M instances, these 2 libraries also show better results than scikit-learn, VW, and R. I'm not too familiar with the H2O implementation and how it differs, but xgboost is a very different algorithm, so it's not surprising it has different behavior.

      My explorations

      I've run Spark on the test set of 10M instances. (Note that the benchmark linked above used somewhat different settings for the different algorithms, but those settings are actually not that important for this problem. This included gini vs. entropy impurity and limits on splitting nodes.)

      I've tried adjusting:

      • maxDepth: Past depth 20, going deeper does not seem to matter
      • maxBins: I've gone up to 500, but this too does not seem to matter. However, this is a hard thing to verify since slight differences in discretization could become significant in a large tree.

      Current questions

      • H2O: It would be good to understand how this implementation differs from standard RF implementations (in R, VW, scikit-learn, and Spark).
      • xgboost: There's a JIRA for it: SPARK-8547. It would be great to see the Spark package linked from that JIRA tested vs. MLlib on the benchmark data (or other data). From what I've heard/read, xgboost is sometimes better, sometimes worse in accuracy (but of course faster with more localized training).
      • Based on the above explorations, are there changes we should make to Spark RFs?

        Issue Links

          Activity

          Hide
          yinxusen Xusen Yin added a comment - - edited

          Joseph K. Bradley Rahul Tanwani Here is what I found:

          1. Dataset preprocessing
          In this dataset, all columns except DepTime and Distance are categorical features. The easiest way to transform the data into LabeledPoint style is RFormula. However, RFormula is not suitable here because it produces different shapes of the dataset in comparison with the original one. RFormula uses One-hot encoder, so it expands the original dataset into thousands of columns.

          It brings two drawbacks:
          a. The volume of the dataset is expanded, which may hurt the performance.
          b. One-hot encoder splits one column into cardinality size of new columns, while Random Forest cannot take groups of features into consideration so that it may hurt the accuracy.

          The RFormula also recognizes DepTime and Distance as categorical features, so it brings more unnecessary new columns and reduces the accuracy a step further because DepTime and Distance are the two most important features for this task.

          On the contrary, H2O uses the original dataset, without further preprocessing.

          2. Spark RandomForest can also get a good result
          In my experiment, Spark RF with 10 trees, 20 maxDepth, and 1m training data gets AUC 0.744321364. In the same setting, H2O gets AUC 0.695598. For detailed result, see https://docs.google.com/document/d/1l7SGFtUkZeM4WEXFlpc08pfBfnu6d25KQFToFHC6CTo/edit?usp=sharing
          Note that those "NA"s mean Spark got OOM on my laptop.

          3. OOM of Spark Random Forest
          In the single machine environment, Spark RF is slower than H2O. What's worse, OOM frequently occurs on Spark with larger bins, larger trees, and larger maxDepth. The reason is Spark creates new Double array quite often inside each partition.

          Say in one partition of our dataset, Spark creates numNodes Double array with length numFeatures * numBins * statsSize. If we use a single machine with 16 partitions, we may generate a new Double array with O(numPartitions * numNodes * numFeatures * numBins * statsSize) Double in total. And I can see from my experiment that the parameter maxMemoryInMB barely useful. It will be better if we use multi-server and spread out those tasks.

          Spark trains random forest in a BFS mode, i.e. the 1st layer of all trees, then the 2nd layer of all trees, while H2O does tree-by-tree, and inside each tree, it trains layer-by-layer. H2O also uses smaller arrays to collect histograms than Spark. It uses Java Fork/Join to split tasks, and inside each task, it generates Double arrays with size numNodes * numFeatures * numBins, then merges them inside a shared DHistogram in each process. (I am not quite sure about the process since DRF code in H2O is more complicated than Spark, and without detailed comments.)

          Besides, H2O also has a MemoryManager to allocate arrays and gets around OOM as long as possible. However, H2O also crashes with OOM once a time on my laptop when I was training 500 trees with 20 maxDepth on 10m dataset.

          Show
          yinxusen Xusen Yin added a comment - - edited Joseph K. Bradley Rahul Tanwani Here is what I found: 1. Dataset preprocessing In this dataset, all columns except DepTime and Distance are categorical features. The easiest way to transform the data into LabeledPoint style is RFormula. However, RFormula is not suitable here because it produces different shapes of the dataset in comparison with the original one. RFormula uses One-hot encoder, so it expands the original dataset into thousands of columns. It brings two drawbacks: a. The volume of the dataset is expanded, which may hurt the performance. b. One-hot encoder splits one column into cardinality size of new columns, while Random Forest cannot take groups of features into consideration so that it may hurt the accuracy. The RFormula also recognizes DepTime and Distance as categorical features, so it brings more unnecessary new columns and reduces the accuracy a step further because DepTime and Distance are the two most important features for this task. On the contrary, H2O uses the original dataset, without further preprocessing. 2. Spark RandomForest can also get a good result In my experiment, Spark RF with 10 trees, 20 maxDepth, and 1m training data gets AUC 0.744321364. In the same setting, H2O gets AUC 0.695598. For detailed result, see https://docs.google.com/document/d/1l7SGFtUkZeM4WEXFlpc08pfBfnu6d25KQFToFHC6CTo/edit?usp=sharing Note that those "NA"s mean Spark got OOM on my laptop. 3. OOM of Spark Random Forest In the single machine environment, Spark RF is slower than H2O. What's worse, OOM frequently occurs on Spark with larger bins, larger trees, and larger maxDepth. The reason is Spark creates new Double array quite often inside each partition. Say in one partition of our dataset, Spark creates numNodes Double array with length numFeatures * numBins * statsSize. If we use a single machine with 16 partitions, we may generate a new Double array with O(numPartitions * numNodes * numFeatures * numBins * statsSize) Double in total. And I can see from my experiment that the parameter maxMemoryInMB barely useful. It will be better if we use multi-server and spread out those tasks. Spark trains random forest in a BFS mode, i.e. the 1st layer of all trees, then the 2nd layer of all trees, while H2O does tree-by-tree, and inside each tree, it trains layer-by-layer. H2O also uses smaller arrays to collect histograms than Spark. It uses Java Fork/Join to split tasks, and inside each task, it generates Double arrays with size numNodes * numFeatures * numBins, then merges them inside a shared DHistogram in each process. (I am not quite sure about the process since DRF code in H2O is more complicated than Spark, and without detailed comments.) Besides, H2O also has a MemoryManager to allocate arrays and gets around OOM as long as possible. However, H2O also crashes with OOM once a time on my laptop when I was training 500 trees with 20 maxDepth on 10m dataset.
          Hide
          tanwanirahul Rahul Tanwani added a comment -

          Xusen Yin Could you please share your findings? Interested to know how H2O implementation is different that what we have with ml.

          Thanks!

          Show
          tanwanirahul Rahul Tanwani added a comment - Xusen Yin Could you please share your findings? Interested to know how H2O implementation is different that what we have with ml. Thanks!
          Hide
          CodingCat Nan Zhu added a comment -

          FYI, we released a solution to integrate XGBoost with Spark directly

          http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html

          Show
          CodingCat Nan Zhu added a comment - FYI, we released a solution to integrate XGBoost with Spark directly http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html
          Hide
          yinxusen Xusen Yin added a comment -

          I'd love to explore this.

          Show
          yinxusen Xusen Yin added a comment - I'd love to explore this.
          Hide
          josephkb Joseph K. Bradley added a comment -

          Xusen Yin or holdenk Would either of you be interested in exploring this (or know others who would be)?

          Show
          josephkb Joseph K. Bradley added a comment - Xusen Yin or holdenk Would either of you be interested in exploring this (or know others who would be)?

            People

            • Assignee:
              Unassigned
              Reporter:
              josephkb Joseph K. Bradley
            • Votes:
              2 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:

                Development