[SPARK-5405] Spark clusterer should support high dimensional data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.2.0
Fix Version/s: None
Component/s: MLlib
Labels:
- clustering

Description

The MLLIB clusterer works well for low (<200) dimensional data. However, performance is linear with the number of dimensions. So, for practical purposes, it is not very useful for high dimensional data.

Depending on the data type, one can embed the high dimensional data into lower dimensional spaces in a distance-preserving way. The Spark clusterer should support such embedding.

An example implementation that supports high dimensional data is here:
https://github.com/derrickburns/generalized-kmeans-clustering

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Derrick Burns

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jan/15 06:29

Updated:: 25/Feb/15 08:52

Resolved:: 25/Feb/15 08:52

Time Tracking

Estimated:

504h

Remaining:

504h

Logged:

Not Specified