[SPARK-18946] treeAggregate will be low effficiency when aggregate high dimension vectors in ML algorithm - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: ML, MLlib
Labels:
- bulk-closed
- features

Description

In many machine learning algorithms, we have to treeAggregate large vectors/arrays due to the large number of features. Unfortunately, the treeAggregate operation of RDD will be low efficiency when the dimension of vectors/arrays is bigger than million. Because high dimension of vector/array always occupy more than 100MB Memory, transferring a 100MB element among executors is pretty low efficiency in Spark.

Attachments

Issue Links

links to

[Github] Pull Request #17000 (ZunwenYou)

GitHub Pull Request #17000

Activity

People

Assignee:: Unassigned

Reporter:: zunwen you

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Dec/16 13:12

Updated:: 25/May/21 01:49

Resolved:: 25/May/21 01:40