[SPARK-10064] Decision tree continuous feature binning is slow in large feature spaces - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.4.1
Fix Version/s: 1.6.0
Component/s: MLlib
Labels:
None

Description

When working with large feature spaces and high bin counts (>500) the binning process can take many hours. This is particularly painful because it ties up executors for the duration, which is not shared-cluster friendly.

The binning process can and should be performed on the executors instead of the driver.

Attachments

Issue Links

relates to

SPARK-12182 Distributed binning for trees in spark.ml

Resolved

SPARK-10785 Scale QuantileDiscretizer using distributed binning

Closed

links to

[Github] Pull Request #8246 (NathanHowell)

Activity

People

Assignee:: Nathan Howell

Reporter:: Nathan Howell

Shepherd:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 17/Aug/15 18:44

Updated:: 07/Dec/15 19:39

Resolved:: 08/Oct/15 00:46