[SPARK-6706] kmeans|| hangs for a long time if both k and vector dimension are large - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.2.1, 1.3.0
Fix Version/s: None
Component/s: MLlib
Labels:
- performance
Environment:

Windows 64bit, Linux 64bit

Description

When doing k-means cluster with the "kmeans||" algorithm which is the default one. The algorithm finished some collect() jobs, then the driver hangs for a long time.

Settings:

k above 100
feature dimension about 360
total data size is about 100 MB

The issue was first noticed with Spark 1.2.1. I tested with both local and cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local mode. *However, I do not have a 1.3.0 cluster environment for me to test.*

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

kmeans-debug.7z
04/Apr/15 01:42
5 kB
Xi Shen

Issue Links

duplicates

SPARK-3220 K-Means clusterer should perform K-Means initialization in parallel

Resolved

links to

[Github] Pull Request #13133 (mouendless)

Activity

People

Assignee:: Xiangrui Meng

Reporter:: Xi Shen

Shepherd:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Apr/15 01:35

Updated:: 16/May/16 10:15

Resolved:: 05/Apr/15 09:26