Description
When working with large feature spaces and high bin counts (>500) the binning process can take many hours. This is particularly painful because it ties up executors for the duration, which is not shared-cluster friendly.
The binning process can and should be performed on the executors instead of the driver.
Attachments
Issue Links
- relates to
-
SPARK-12182 Distributed binning for trees in spark.ml
- Resolved
-
SPARK-10785 Scale QuantileDiscretizer using distributed binning
- Closed
- links to