Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.0
-
None
Description
The implementation of RF is bound by either the cost of statistics computation on workers or by communicating the sufficient statistics.
The statistics are stored in allStats:
/** * Flat array of elements. * Index for start of stats for a (feature, bin) is: * index = featureOffsets(featureIndex) + binIndex * statsSize */ private var allStats: Array[Double] = new Array[Double](allStatsSize)
The size of allStats maybe very large, and it can be very sparse, especially on the nodes that near the leave of the tree.
I have changed allStats from Array to SparseVector, my tests show the communication is down by about 50%.
Attachments
Issue Links
- links to