[SPARK-4708] Make k-mean runs two/three times faster with dense/sparse sample - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: MLlib
Labels:
None

Target Version/s:

1.2.0

Description

Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and breezeSquaredDistance is slow. We should replace it with our own implementation.

Here is the benchmark against mnist8m dataset.

Before
DenseVector: 70.04secs
SparseVector: 59.05secs

With this PR
DenseVector: 30.58secs
SparseVector: 21.14secs

Attachments

Issue Links

links to

[Github] Pull Request #3565 (dbtsai)

Activity

People

Assignee:: DB Tsai

Reporter:: DB Tsai

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Dec/14 01:20

Updated:: 03/Dec/14 11:03

Resolved:: 03/Dec/14 11:03