Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10629

Gradient boosted trees: mapPartitions input size increasing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.4.1
    • None
    • MLlib
    • None

    Description

      First of all, I think my problem is quite different from https://issues.apache.org/jira/browse/SPARK-10433, which point that the input size increasing at each iteration.

      My problem is the mapPartitions input size increase in one iteration. My training samples has 2958359 features in total. Within one iteration, 3 collectAsMap operation had been called. And here is a summary of each call.

      Stage Id Description Duration Input Shuffle Read Shuffle Write
      :----------: :---------------------------------------------------: :-----------: :-----------: :----------------: :----------------:
      4 mapPartitions at DecisionTree.scala:613 1.6 h 710.2 MB   2.8 GB
      5 collectAsMap at DecisionTree.scala:642 1.8 min   2.8 GB  
      6 mapPartitions at DecisionTree.scala:613 1.2 h 27.0 GB   5.6 GB
      7 collectAsMap at DecisionTree.scala:642 2.0 min   5.6GB  
      8 mapPartitions at DecisionTree.scala:613 1.2 h 26.5 GB   11.1 GB
      9 collectAsMap at DecisionTree.scala:642 2.0 min   8.3 GB  

      the mapPartitions operation took too long time! It's so strange! I wonder whether there is bug exits?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Vimin_Wu Wenmin Wu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: