Mahout
  1. Mahout
  2. MAHOUT-279

Make RandomSeedGenerator a M/R Job

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Later
    • Affects Version/s: 0.3
    • Fix Version/s: None
    • Component/s: Clustering
    • Labels:
      None

      Description

      Speedup Random Centroid Selection for clustering using Map/Reduce

      Increasing the scope of this issue.

      • Random Seed Generator could take a distance measure and a threshold and use that information during random eviction and insertion to increase the distance between two centroids

        Activity

        Hide
        Sean Owen added a comment -

        Here's a different suggestion. The problem is efficiently picking a couple vectors out of billions. An M/R seems like such overkill.

        This patch just picks random points in the file, syncs, and reads. Unless the underlying implementation is awful, this should be super fast. The downside is the choice is slightly biased. We could fix that if needed.

        I don't know if this works, is there a way to test reading on real input?

        Show
        Sean Owen added a comment - Here's a different suggestion. The problem is efficiently picking a couple vectors out of billions. An M/R seems like such overkill. This patch just picks random points in the file, syncs, and reads. Unless the underlying implementation is awful, this should be super fast. The downside is the choice is slightly biased. We could fix that if needed. I don't know if this works, is there a way to test reading on real input?
        Hide
        Sean Owen added a comment -

        Bah, it doesn't actually work in Hadoop, for reasons I don't quite get. Nevermind.

        Show
        Sean Owen added a comment - Bah, it doesn't actually work in Hadoop, for reasons I don't quite get. Nevermind.
        Hide
        Ted Dunning added a comment -

        Is this overlapping with the k-means++ stuff?

        Show
        Ted Dunning added a comment - Is this overlapping with the k-means++ stuff?
        Hide
        Sean Owen added a comment -

        Am I right that this has stalled out, not for 0.4 at least?

        Show
        Sean Owen added a comment - Am I right that this has stalled out, not for 0.4 at least?
        Hide
        Ted Dunning added a comment -

        Seems right to me (not for 0.4, that is).

        Show
        Ted Dunning added a comment - Seems right to me (not for 0.4, that is).
        Hide
        Jeff Eastman added a comment -

        Moving this from limbo to 0.5

        Show
        Jeff Eastman added a comment - Moving this from limbo to 0.5
        Hide
        Sean Owen added a comment -

        What's the thinking here – a good use case for it, patch should be cleaned up? or is this no longer interesting?

        Show
        Sean Owen added a comment - What's the thinking here – a good use case for it, patch should be cleaned up? or is this no longer interesting?
        Hide
        Sean Owen added a comment -

        Am I right that this one is dead?

        Show
        Sean Owen added a comment - Am I right that this one is dead?

          People

          • Assignee:
            Robin Anil
            Reporter:
            Robin Anil
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development