Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19208

MultivariateOnlineSummarizer performance optimization

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • ML
    • None

    Description

      Now, MaxAbsScaler and MinMaxScaler are using MultivariateOnlineSummarizer to compute the min/max.
      However MultivariateOnlineSummarizer will also compute extra unused statistics. It slows down the task, moreover it is more prone to cause OOM.

      For example:
      env : --driver-memory 4G --executor-memory 1G --num-executors 4
      data: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra) 748401 instances, and 29,890,095 features
      MaxAbsScaler.fit fails because of OOM

      MultivariateOnlineSummarizer maintains 8 arrays:

      private var currMean: Array[Double] = _
        private var currM2n: Array[Double] = _
        private var currM2: Array[Double] = _
        private var currL1: Array[Double] = _
        private var totalCnt: Long = 0
        private var totalWeightSum: Double = 0.0
        private var weightSquareSum: Double = 0.0
        private var weightSum: Array[Double] = _
        private var nnz: Array[Long] = _
        private var currMax: Array[Double] = _
        private var currMin: Array[Double] = _
      

      For MaxAbsScaler, only 1 array is needed (max of abs value)
      For MinMaxScaler, only 3 arrays are needed (max, min, nnz)

      After modication in the pr, the above example run successfully.

      Attachments

        1. WechatIMG2621.jpeg
          565 kB
          Ruifeng Zheng
        2. Tests.pdf
          3.56 MB
          Ruifeng Zheng

        Activity

          People

            Unassigned Unassigned
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: