[SPARK-4581] Refactorize StandardScaler to improve the transformation performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: MLlib
Labels:
None

Target Version/s:

1.2.0

Description

The following optimizations are done to improve the StandardScaler model transformation performance.

1) Covert Breeze dense vector to primitive vector to reduce the overhead.
2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector.
3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call.
4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to make the codebase cleaner.

Benchmark with mnist8m dataset:

Before,
DenseVector withMean and withStd: 50.97secs
DenseVector withMean and withoutStd: 42.11secs
DenseVector withoutMean and withStd: 8.75secs
SparseVector withoutMean and withStd: 5.437

With this PR,
DenseVector withMean and withStd: 5.76secs
DenseVector withMean and withoutStd: 5.28secs
DenseVector withoutMean and withStd: 5.30secs
SparseVector withoutMean and withStd: 1.27

Attachments

Issue Links

links to

[Github] Pull Request #3435 (dbtsai)

Activity

People

Assignee:: DB Tsai

Reporter:: DB Tsai

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/Nov/14 23:58

Updated:: 25/Nov/14 19:07

Resolved:: 25/Nov/14 19:07