Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13434

Reduce Spark RandomForest memory footprint

    XMLWordPrintableJSON

    Details

      Description

      The RandomForest implementation can easily run out of memory on moderate datasets. This was raised in the a user's benchmarking game on github (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there was a tracking issue, but I couldn't fine one.

      Using Spark 1.6, a user of mine is running into problems running the RandomForest training on largish datasets on machines with 64G memory and the following in spark-defaults.conf:

      spark.executor.cores 2
      spark.executor.instances 199
      spark.executor.memory 10240M
      

      I reproduced the excessive memory use from the benchmark example (using an input CSV of 1.3G and 686 columns) in spark shell with spark-shell --driver-memory 30G --executor-memory 30G and have a heap profile from a single machine by running jmap -histo:live <spark-pid>. I took a sample every 5 seconds and at the peak it looks like this:

       num     #instances         #bytes  class name
      ----------------------------------------------
         1:       5428073     8458773496  [D
         2:      12293653     4124641992  [I
         3:      32508964     1820501984  org.apache.spark.mllib.tree.model.Node
         4:      53068426     1698189632  org.apache.spark.mllib.tree.model.Predict
         5:      72853787     1165660592  scala.Some
         6:      16263408      910750848  org.apache.spark.mllib.tree.model.InformationGainStats
         7:         72969      390492744  [B
         8:       3327008      133080320  org.apache.spark.mllib.tree.impl.DTStatsAggregator
         9:       3754500      120144000  scala.collection.immutable.HashMap$HashMap1
        10:       3318349      106187168  org.apache.spark.mllib.tree.model.Split
        11:       3534946       84838704  org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
        12:       3764745       60235920  java.lang.Integer
        13:       3327008       53232128  org.apache.spark.mllib.tree.impurity.EntropyAggregator
        14:        380804       45361144  [C
        15:        268887       34877128  <constMethodKlass>
        16:        268887       34431568  <methodKlass>
        17:        908377       34042760  [Lscala.collection.immutable.HashMap;
        18:       1100000       26400000  org.apache.spark.mllib.regression.LabeledPoint
        19:       1100000       26400000  org.apache.spark.mllib.linalg.SparseVector
        20:         20206       25979864  <constantPoolKlass>
        21:       1000000       24000000  org.apache.spark.mllib.tree.impl.TreePoint
        22:       1000000       24000000  org.apache.spark.mllib.tree.impl.BaggedPoint
        23:        908332       21799968  scala.collection.immutable.HashMap$HashTrieMap
        24:         20206       20158864  <instanceKlassKlass>
        25:         17023       14380352  <constantPoolCacheKlass>
        26:            16       13308288  [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
        27:        445797       10699128  scala.Tuple2
      

        Attachments

        1. rf-heap-usage.png
          61 kB
          Ewan Higgs
        2. heap-usage.log
          612 kB
          Ewan Higgs

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ehiggs Ewan Higgs
              • Votes:
                4 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: