Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19208

MultivariateOnlineSummarizer performance optimization

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ML
    • Labels:
      None

      Description

      Now, MaxAbsScaler and MinMaxScaler are using MultivariateOnlineSummarizer to compute the min/max.
      However MultivariateOnlineSummarizer will also compute extra unused statistics. It slows down the task, moreover it is more prone to cause OOM.

      For example:
      env : --driver-memory 4G --executor-memory 1G --num-executors 4
      data: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra) 748401 instances, and 29,890,095 features
      MaxAbsScaler.fit fails because of OOM

      MultivariateOnlineSummarizer maintains 8 arrays:

      private var currMean: Array[Double] = _
        private var currM2n: Array[Double] = _
        private var currM2: Array[Double] = _
        private var currL1: Array[Double] = _
        private var totalCnt: Long = 0
        private var totalWeightSum: Double = 0.0
        private var weightSquareSum: Double = 0.0
        private var weightSum: Array[Double] = _
        private var nnz: Array[Long] = _
        private var currMax: Array[Double] = _
        private var currMin: Array[Double] = _
      

      For MaxAbsScaler, only 1 array is needed (max of abs value)
      For MinMaxScaler, only 3 arrays are needed (max, min, nnz)

      After modication in the pr, the above example run successfully.

        Attachments

        1. Tests.pdf
          3.56 MB
          zhengruifeng
        2. WechatIMG2621.jpeg
          565 kB
          zhengruifeng

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              podongfeng zhengruifeng
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: