Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
None
-
None
-
None
Description
Now, MaxAbsScaler and MinMaxScaler are using MultivariateOnlineSummarizer to compute the min/max.
However MultivariateOnlineSummarizer will also compute extra unused statistics. It slows down the task, moreover it is more prone to cause OOM.
For example:
env : --driver-memory 4G --executor-memory 1G --num-executors 4
data: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra) 748401 instances, and 29,890,095 features
MaxAbsScaler.fit fails because of OOM
MultivariateOnlineSummarizer maintains 8 arrays:
private var currMean: Array[Double] = _ private var currM2n: Array[Double] = _ private var currM2: Array[Double] = _ private var currL1: Array[Double] = _ private var totalCnt: Long = 0 private var totalWeightSum: Double = 0.0 private var weightSquareSum: Double = 0.0 private var weightSum: Array[Double] = _ private var nnz: Array[Long] = _ private var currMax: Array[Double] = _ private var currMin: Array[Double] = _
For MaxAbsScaler, only 1 array is needed (max of abs value)
For MinMaxScaler, only 3 arrays are needed (max, min, nnz)
After modication in the pr, the above example run successfully.