Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25219

KMeans Clustering - Text Data - Results are incorrect

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.0
    • None
    • ML
    • Important

    Description

      Hello Everyone,

      I am facing issues with the usage of KMeans Clustering on my text data. When I apply clustering on my text data, after performing various transformations such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated clusters are not proper and one cluster is found to have lot of data points assigned to it.

      I am able to perform clustering with similar kind of processing and with the same attributes on the SKLearn KMeans algorithm. 

      Upon searching in internet, I observe many have reported the same issue with KMeans clustering library of Spark.

      Request your help in fixing this issue.

      Please let me know if you require any additional details.

      Attachments

        1. SKLearn_Kmeans.txt
          1 kB
          Vasanthkumar Velayudham
        2. Apache_Logs_Results.xlsx
          20 kB
          Vasanthkumar Velayudham
        3. Spark_Kmeans.txt
          2 kB
          Vasanthkumar Velayudham

        Activity

          People

            Unassigned Unassigned
            VVasanth Vasanthkumar Velayudham
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: