Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4581

Refactorize StandardScaler to improve the transformation performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.2.0
    • MLlib
    • None

    Description

      The following optimizations are done to improve the StandardScaler model transformation performance.

      1) Covert Breeze dense vector to primitive vector to reduce the overhead.
      2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector.
      3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call.
      4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to make the codebase cleaner.

      Benchmark with mnist8m dataset:

      Before,
      DenseVector withMean and withStd: 50.97secs
      DenseVector withMean and withoutStd: 42.11secs
      DenseVector withoutMean and withStd: 8.75secs
      SparseVector withoutMean and withStd: 5.437

      With this PR,
      DenseVector withMean and withStd: 5.76secs
      DenseVector withMean and withoutStd: 5.28secs
      DenseVector withoutMean and withStd: 5.30secs
      SparseVector withoutMean and withStd: 1.27

      Attachments

        Activity

          People

            dbtsai DB Tsai
            dbtsai DB Tsai
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: