[SPARK-31976] use MemoryUsage to control the size of block - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Resolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: ML, PySpark
Labels:
None

Target Version/s:

3.2.0

Description

According to the performance test in https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is mainly related to the nnz of block.

So it maybe reasonable to control the size of block by memory usage, instead of number of rows.

note1: param blockSize had already used in ALS and MLP to stack vectors (expected to be dense);

note2: we may refer to the Strategy.maxMemoryInMB in tree models;

There may be two ways to impl:

1, compute the sparsity of input vectors ahead of train (this can be computed with other statistics computation, maybe no extra pass), and infer a reasonable number of vectors to stack;

2, stack the input vectors adaptively, by monitoring the memory usage in a block;

Attachments

Issue Links

links to

[Github] Pull Request #28974 (zhengruifeng)

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 12/Jun/20 07:19

Updated:: 19/Mar/21 01:16

Resolved:: 15/Dec/20 08:56