Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5405

Spark clusterer should support high dimensional data

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.2.0
    • None
    • MLlib

    Description

      The MLLIB clusterer works well for low (<200) dimensional data. However, performance is linear with the number of dimensions. So, for practical purposes, it is not very useful for high dimensional data.

      Depending on the data type, one can embed the high dimensional data into lower dimensional spaces in a distance-preserving way. The Spark clusterer should support such embedding.

      An example implementation that supports high dimensional data is here:
      https://github.com/derrickburns/generalized-kmeans-clustering

      Attachments

        Activity

          People

            Unassigned Unassigned
            derrickburns Derrick Burns
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 504h
                504h
                Remaining:
                Remaining Estimate - 504h
                504h
                Logged:
                Time Spent - Not Specified
                Not Specified