[MAHOUT-1214] Improve the accuracy of the Spectral KMeans Method - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.7
Fix Version/s: 0.8
Component/s: classic
Labels:
- clustering
- improvement
Environment:

Mahout 0.7

Description

The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. NIPS 2002) in version 0.7 has two serious issues. These two incorrect implementations make it fail even for a very obvious trivial dataset. We have implemented a solution to resolve these two issues and hope to contribute back to the community.

Issue 1:
The EigenVerificationJob in version 0.7 does not check the orthogonality of eigenvectors, which is necessary to obtain the correct clustering results for the case of K>1; We have an idea and implementation to select based on cosAngle/orthogonality;

Issue 2:
The random seed initialization of KMeans algorithm is not optimal and sometimes a bad initialization will generate wrong clustering result. In this case, the selected K eigenvector actually provides a better way to initalize cluster centroids because each selected eigenvector is a relaxed indicator of the memberships of one cluster. For every selected eigenvector, we use the data point whose eigen component achieves the maximum absolute value.

We have already verified our improvement on synthetic dataset and it shows that the improved version get the optimal clustering result while the current 0.7 version obtains the wrong result.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAHOUT-1214.patch
25/Jun/13 06:16
39 kB
zhangda
MAHOUT-1214.patch
18/Jun/13 03:25
4 kB
Robin Anil
matrix_1
07/Jun/13 06:29
0.4 kB
Yiqun Hu
matrix_2
07/Jun/13 06:29
0.4 kB
Yiqun Hu

Activity

People

Assignee:: Robin Anil

Reporter:: Yiqun Hu

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 16/May/13 07:54

Updated:: 31/Jan/24 22:14

Resolved:: 27/Jun/13 22:32