Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4581

Refactorize StandardScaler to improve the transformation performance

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: MLlib
    • Labels:
      None
    • Target Version/s:

      Description

      The following optimizations are done to improve the StandardScaler model transformation performance.

      1) Covert Breeze dense vector to primitive vector to reduce the overhead.
      2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector.
      3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call.
      4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to make the codebase cleaner.

      Benchmark with mnist8m dataset:

      Before,
      DenseVector withMean and withStd: 50.97secs
      DenseVector withMean and withoutStd: 42.11secs
      DenseVector withoutMean and withStd: 8.75secs
      SparseVector withoutMean and withStd: 5.437

      With this PR,
      DenseVector withMean and withStd: 5.76secs
      DenseVector withMean and withoutStd: 5.28secs
      DenseVector withoutMean and withStd: 5.30secs
      SparseVector withoutMean and withStd: 1.27

        Attachments

          Activity

            People

            • Assignee:
              dbtsai DB Tsai
              Reporter:
              dbtsai DB Tsai
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: