[SPARK-3383] DecisionTree aggregate size could be smaller - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: MLlib
Labels:
- bulk-closed

Description

Storage and communication optimization:
DecisionTree aggregate statistics could store less data (described below). The savings would be significant for datasets with many low-arity categorical features (binary features, or unordered categorical features). Savings would be negligible for continuous features.

DecisionTree stores a vector sufficient statistics for each (node, feature, bin). We could store 1 fewer bin per (node, feature): For a given (node, feature), if we store these vectors for all but the last bin, and also store the total statistics for each node, then we could compute the statistics for the last bin. For binary and unordered categorical features, this would cut in half the number of bins to store and communicate.

Attachments

Issue Links

Is contained by

SPARK-14045 DecisionTree improvement umbrella

Resolved

is related to

SPARK-14351 Optimize ImpurityAggregator for decision trees

Resolved

SPARK-22451 Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories)

Resolved

relates to

SPARK-10788 Decision Tree duplicates bins for unordered categorical features

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 03/Sep/14 18:53

Updated:: 21/May/19 04:15

Resolved:: 21/May/19 04:15