Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21624

Optimize communication cost of RF/GBT/DT

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.0
    • None
    • ML, MLlib

    Description

      The implementation of RF is bound by either the cost of statistics computation on workers or by communicating the sufficient statistics.

      The statistics are stored in allStats:

        /**
         * Flat array of elements.
         * Index for start of stats for a (feature, bin) is:
         *   index = featureOffsets(featureIndex) + binIndex * statsSize
         */
        private var allStats: Array[Double] = new Array[Double](allStatsSize)
      

      The size of allStats maybe very large, and it can be very sparse, especially on the nodes that near the leave of the tree.

      I have changed allStats from Array to SparseVector, my tests show the communication is down by about 50%.

      Attachments

        Activity

          People

            Unassigned Unassigned
            peng.meng@intel.com Peng Meng
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: