[SPARK-3043] DecisionTree aggregation is inefficient - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.2.0
Component/s: MLlib
Labels:
None

Target Version/s:

1.2.0

Description

2 major efficiency issues in computation and storage:

(1) DecisionTree aggregation involves reshaping data unnecessarily.

E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve reshaping the data multiple times without real computation.

(2) DecisionTree splits and aggregate bins can include many unused bins/splits.

The same number of splits/bins are used for all features. E.g., if there is a continuous feature which uses 100 bins, then there will also be 100 bins allocated for all binary features, even though only 2 are necessary.

Attachments

Issue Links

contains

SPARK-3157 Avoid duplicated stats in DecisionTree extractLeftRightNodeAggregates

Closed

links to

[Github] Pull Request #2125 (jkbradley)

Activity

People

Assignee:: Joseph K. Bradley

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Aug/14 18:21

Updated:: 21/Jul/15 17:43

Resolved:: 08/Sep/14 16:48