[SPARK-22451] Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Description

Do not need generate all possible splits for unordered features before aggregate,
in aggregete (executor side):
1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat with ordered features. so for unordered features, we only need O(numCategories) space for this feature stat.
2. After driver side get the aggregate result, generate all possible split combinations, and compute the best split.

This will reduce decision tree aggregate size for each unordered feature from O(2^numCategories) to O(numCategories), `numCategories` is the arity of this unordered feature.

This also reduce the cpu cost in executor side. Reduce time complexity for this unordered feature from O(numPoints * 2^numCategories) to O(numPoints).

This won't increase time complexity for unordered features best split computing in driver side.

Attachments

Issue Links

Is contained by

SPARK-14045 DecisionTree improvement umbrella

Resolved

relates to

SPARK-3383 DecisionTree aggregate size could be smaller

Resolved

links to

[Github] Pull Request #19666 (WeichenXu123)

Activity

People

Assignee:: Unassigned

Reporter:: Weichen Xu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Nov/17 10:10

Updated:: 21/May/19 04:11

Resolved:: 21/May/19 04:11

Time Tracking

Estimated:

24h

Remaining:

24h

Logged:

Not Specified