Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1539

Implement affinity matrix computation in Mahout DSL

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.9
    • Fix Version/s: 0.11.0
    • Component/s: Clustering
    • Labels:

      Description

      This has the same goal as MAHOUT-1506, but rather than code the pairwise computations in MapReduce, this will be done in the Mahout DSL.

      An orthogonal issue is the format of the raw input (vectors, text, images, SequenceFiles), and how the user specifies the distance equation and any associated parameters.

        Activity

        Hide
        smarthi Suneel Marthi added a comment -

        Closing issue following 0.11.0 Release

        Show
        smarthi Suneel Marthi added a comment - Closing issue following 0.11.0 Release
        Hide
        smarthi Suneel Marthi added a comment -

        Issue has been open for over an year with no progress, please feel free to create a new Jira when you find time to work on these.

        Show
        smarthi Suneel Marthi added a comment - Issue has been open for over an year with no progress, please feel free to create a new Jira when you find time to work on these.
        Hide
        kanjilal Saikat Kanjilal added a comment - - edited

        Enough with high level concepts already , so I took the next logical step:

        I'm not ready to include my code into the mahout master repo yet, so I created my own repo and started a sample implementation there, you will see a first cut of LocalitySensitiveHashing implemented using Euclidean Distance only, code is at least compiling as a first step:

        https://github.com/skanjila/AffinityMatrix

        TBD
        1) Implement unit and potentially integration tests to test performance of this
        2) Once LSH is all the way tested I will then implement the affinityMatrix piece on top of this
        3) I will then add some more unit tests for Affinitymatrix
        4) I will then add CosineDistance and ManhattanDistance as configurable parameters
        5) I will need to incorporate into spark API specifically invoking the SparkContext and using the broadcast mechanisms in the spark clusters as appropriate
        6) I will merge this into my mahout checkout out branch

        Some early feedback on the code would be greatly appreciated, watch for changes in my repo coming frequently

        Show
        kanjilal Saikat Kanjilal added a comment - - edited Enough with high level concepts already , so I took the next logical step: I'm not ready to include my code into the mahout master repo yet, so I created my own repo and started a sample implementation there, you will see a first cut of LocalitySensitiveHashing implemented using Euclidean Distance only, code is at least compiling as a first step: https://github.com/skanjila/AffinityMatrix TBD 1) Implement unit and potentially integration tests to test performance of this 2) Once LSH is all the way tested I will then implement the affinityMatrix piece on top of this 3) I will then add some more unit tests for Affinitymatrix 4) I will then add CosineDistance and ManhattanDistance as configurable parameters 5) I will need to incorporate into spark API specifically invoking the SparkContext and using the broadcast mechanisms in the spark clusters as appropriate 6) I will merge this into my mahout checkout out branch Some early feedback on the code would be greatly appreciated, watch for changes in my repo coming frequently
        Hide
        andrew.musselman Andrew Musselman added a comment -

        I'd say start with something commonly used, like vectors.

        Please make a pull request as soon as you can so we can look at actual code rather than just concepts, then develop from there.

        Show
        andrew.musselman Andrew Musselman added a comment - I'd say start with something commonly used, like vectors. Please make a pull request as soon as you can so we can look at actual code rather than just concepts, then develop from there.
        Hide
        kanjilal Saikat Kanjilal added a comment - - edited

        So I did some more research and have some questions:

        1) Are we going to deal with images or text data to start?
        2) What do we really mean by data point, in my mind its represented by a (x,y)
        3) I think the similarity measure associated with determining locality sensitive hashing should be configurable, namely we should be able to plug in Jacard/Euclidean or Cosine similarities as functions to be computed

        I have a sample localitysensitivehashing scheme coded up in scala but want to get further clarifications on the above before I proceed further

        Thanks for your help

        Show
        kanjilal Saikat Kanjilal added a comment - - edited So I did some more research and have some questions: 1) Are we going to deal with images or text data to start? 2) What do we really mean by data point, in my mind its represented by a (x,y) 3) I think the similarity measure associated with determining locality sensitive hashing should be configurable, namely we should be able to plug in Jacard/Euclidean or Cosine similarities as functions to be computed I have a sample localitysensitivehashing scheme coded up in scala but want to get further clarifications on the above before I proceed further Thanks for your help
        Hide
        magsol Shannon Quinn added a comment -

        It's a good place to start. But yes, you'll need to replace all the Scipy/NumPy/scikit dependencies, which unfortunately account for a lot of the base functionality.

        Show
        magsol Shannon Quinn added a comment - It's a good place to start. But yes, you'll need to replace all the Scipy/NumPy/scikit dependencies, which unfortunately account for a lot of the base functionality.
        Hide
        kanjilal Saikat Kanjilal added a comment -

        Shannon as a first cut I ran the pyscala utility to convert the python code to scala, am attaching the file here, I will now begin converting and cleaning up this code, please attach any feedback to this jira ticket.

        Show
        kanjilal Saikat Kanjilal added a comment - Shannon as a first cut I ran the pyscala utility to convert the python code to scala, am attaching the file here, I will now begin converting and cleaning up this code, please attach any feedback to this jira ticket.
        Hide
        kanjilal Saikat Kanjilal added a comment -

        Here's my plan:
        1) read through your python-spark implementation
        2) come up with a design and attach it to this JIRA ticket
        3) start coding and write unit tests as well once we agree on the design using mahout-DSL

        Sound ok?

        Show
        kanjilal Saikat Kanjilal added a comment - Here's my plan: 1) read through your python-spark implementation 2) come up with a design and attach it to this JIRA ticket 3) start coding and write unit tests as well once we agree on the design using mahout-DSL Sound ok?
        Hide
        magsol Shannon Quinn added a comment -

        Sent you an email, but short story is I'd love some help--working on finishing my PhD by November

        Show
        magsol Shannon Quinn added a comment - Sent you an email, but short story is I'd love some help--working on finishing my PhD by November
        Hide
        kanjilal Saikat Kanjilal added a comment -

        Shannon I'd like to help with this issue, would you mind if I start working on this?

        Show
        kanjilal Saikat Kanjilal added a comment - Shannon I'd like to help with this issue, would you mind if I start working on this?

          People

          • Assignee:
            magsol Shannon Quinn
            Reporter:
            magsol Shannon Quinn
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development