Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17801

[ML]Random Forest Regression fails for large input

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • 1.6.1
    • None
    • ML
    • None
    • Ubuntu 14.04

    Description

      Random Forest Regression
      Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip

      Parameters:
      NumTrees:500 Maximum Bins:7477383 MaxDepth:27
      MinInstancesPerNode:8648 SamplingRate:1.0

      Java Options:
      "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms" "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" "-XX:SurvivorRatio=3" "-DnumPartitions=36"

      Partial Driver StackTrace:
      org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740)
      org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525)
      org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160)
      org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209)
      org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197)
      org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
      org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
      org.apache.spark.ml.Estimator.fit(Estimator.scala:59)
      org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)
      org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78)

      For complete Executor and Driver ErrorLog
      https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6

      Attachments

        Activity

          People

            Unassigned Unassigned
            samkit samkit
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: