[SPARK-25219] KMeans Clustering - Text Data - Results are incorrect - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Flags:

Important

Description

Hello Everyone,

I am facing issues with the usage of KMeans Clustering on my text data. When I apply clustering on my text data, after performing various transformations such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated clusters are not proper and one cluster is found to have lot of data points assigned to it.

I am able to perform clustering with similar kind of processing and with the same attributes on the SKLearn KMeans algorithm.

Upon searching in internet, I observe many have reported the same issue with KMeans clustering library of Spark.

Request your help in fixing this issue.

Please let me know if you require any additional details.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SKLearn_Kmeans.txt
28/Aug/18 17:48
1 kB
Vasanthkumar Velayudham
Apache_Logs_Results.xlsx
28/Aug/18 17:48
20 kB
Vasanthkumar Velayudham
Spark_Kmeans.txt
28/Aug/18 17:48
2 kB
Vasanthkumar Velayudham

Activity

People

Assignee:: Unassigned

Reporter:: Vasanthkumar Velayudham

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Aug/18 23:39

Updated:: 08/Oct/19 05:42

Resolved:: 08/Oct/19 05:42