[SPARK-32061] potential regression if use memoryUsage instead of numRows - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Resolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: ML, PySpark
Labels:
None

Description

1, if the `memoryUsage` is improperly set, for example, too small to store a instance;

2, the blockify+GMM reuse two matrices whose shape is related to current blockSize:

@transient private lazy val auxiliaryProbMat = DenseMatrix.zeros(blockSize, k)
@transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, numFeatures)

When implementing blockify+GMM, I found that if I do not pre-allocate those matrices, there will be seriously regression (maybe 3~4 slower, I fogot the detailed numbers);

3, in MLP, three pre-allocated objects are also related to numRows:

if (ones == null || ones.length != delta.cols) ones = BDV.ones[Double](delta.cols)

// TODO: allocate outputs as one big array and then create BDMs from it
if (outputs == null || outputs(0).cols != currentBatchSize) {
...

// TODO: allocate deltas as one big array and then create BDMs from it
if (deltas == null || deltas(0).cols != currentBatchSize) {
  deltas = new Array[BDM[Double]](layerModels.length)
...

I am not very familiar with the impl of MLP and failed to find some related document about this pro-allocation. But I guess there maybe regression if we disable this pro-allocation, since those objects look relatively big.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 23/Jun/20 01:47

Updated:: 18/Mar/21 07:17

Resolved:: 18/Mar/21 07:17