Mahout
  1. Mahout
  2. MAHOUT-363

Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3
    • Fix Version/s: 0.4
    • Component/s: Clustering
    • Labels:

      Description

      Proposal Title: EigenCuts spectral clustering implementation on map/reduce for Apache Mahout (addresses issue Mahout-328)

      Student Name: Shannon Quinn

      Student E-mail: magsol@gmail.com

      Organization/Project:Assigned Mentor:

      Proposal Abstract:
      Clustering algorithms are advantageous when the number of classes are not known a priori. However, most techniques still require an explicit K to be chosen, and most spectral algorithms' use of piecewise constant approximation of eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] solves both these problems by choosing an eigenvector to create a new cluster boundary and iterating until no more edges are cut.

      Detailed Description

      Clustering techniques are extremely useful unsupervised methods, particularly within my field of computational biology, for situations where the number (and often the characteristics as well) of classes expressed in the data are not known a priori. K-means is a classic technique which, given some K, attempts to label data points within a cluster as a function of their distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.

      Another approach is spectral clustering, which models the data as a weighted, undirected graph in some n-dimensional space, and creates a matrix M of transition probabilities between nodes. By computing the eigenvalues and eigenvectors of this matrix, most spectral clustering techniques take advantage of the fact that, for data with loosely coupled clusters, the K leading eigenvectors will identify the roughly piecewise constant regions in the data that correspond to clusters.

      However, these techniques all suffer from drawbacks, the two most significant of which are having to choose an arbitrary K a priori, and in the situation of tightly coupled clusters where the piecewise constant approximation on the eigenvectors no longer holds.

      The EigenCuts algorithm addresses both these issues. As a type of spectral clustering algorithm it works by constructing a Markov chain representation of the data and computing the eigenvectors and eigenvalues of the transition matrix. Eigenflows, or flow of probability by eigenvector, have an associated half life of flow decay called eigenflow. By perturbing the weights between nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, allowing for the identification of boundaries between clusters. Thus, this algorithm iterates until no more cuts between clusters need to be made, eliminating the need for an a prior K, and conferring the ability to separate tightly coupled clusters.

      The only disadvantage of EigenCuts is the need to recompute eigenvectors and eigenvalues at each iterative step, incurring a large computational overhead. This problem can be adequately addressed within the map/reduce framework and on a Hadoop cluster by parallelizing the computation of each eigenvector and its associated eigenvalue. Apache Hama in particular, with its specializations in graph and matrix data, will be crucial in parallelizing the computations of transition matrices and their corresponding eigenvalues and eigenvectors at each iteration.

      Since Dr Chennubhotla is currently a member of the faculty at the University of Pittsburgh, I have been in contact with him for the past few weeks, and we both envision and eagerly anticipate continued collaboration over the course of the summer and this project's implementation. His guidance in highlighting the finer points of the underlying theory, coupled with my experience in and knowledge of software engineering, makes this is a project we are both extremely excited about implementing.

      Timeline

      At the end of each sprint, there should be a concrete, functional deliverable. It may not do much, but what it does will work and have full coverage accompanying unit tests.

      Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - familiarization with and dev deployments of Hadoop and Mahout; reading up on documentation, fine-tuning the project plan and requirements. This part will kick into high gear after May 6 (my last final exam and final academic obligation, prior to the actual graduation ceremony), but likely nothing before April 29 (the day of my thesis defense).

      Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows for dataset selection and visualization of resulting clusters.

      Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral clustering. Integrate map/reduce framework via Mahout and take advantage of existing core calculation of transition matrices and associated eigenvectors and eigenvalues.

      Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.

      Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with EigenCuts. Finalize interface for running the algorithm, selecting datasets and visualizing results.

      Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit tests, fix outstanding bugs.

      Other Information

      I am finishing up my last semester as a master's student in computational biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at Carnegie Mellon this coming fall. I have worked extensively with clustering techniques as a master's student, as a significant amount of my work has involved bioimage analysis within Dr Robert Murphy's lab. Previous work has involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW to cluster those patterns according to geographic location. My master's thesis involves use of matrix completion to infer linear transformations of proteins and their associated subcellular locations across varying cell conditions (drugs, cell lines, etc).

      I unfortunately have limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly.

      References

      [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002.

      1. MAHOUT-363JE.patch
        132 kB
        Jeff Eastman
      2. MAHOUT-363JE.patch
        131 kB
        Jeff Eastman
      3. MAHOUT-363.patch
        20 kB
        Shannon Quinn
      4. MAHOUT-363.patch
        19 kB
        Shannon Quinn
      5. MAHOUT-363.patch
        19 kB
        Shannon Quinn
      6. MAHOUT-363.patch
        20 kB
        Shannon Quinn
      7. MAHOUT-363.patch
        23 kB
        Shannon Quinn
      8. MAHOUT-363.patch
        26 kB
        Shannon Quinn
      9. MAHOUT-363.patch
        27 kB
        Shannon Quinn
      10. MAHOUT-363.patch
        27 kB
        Shannon Quinn
      11. MAHOUT-363.patch
        38 kB
        Shannon Quinn
      12. MAHOUT-363.patch
        39 kB
        Shannon Quinn
      13. MAHOUT-363.patch
        50 kB
        Shannon Quinn
      14. MAHOUT-363.patch
        70 kB
        Shannon Quinn
      15. MAHOUT-363.patch
        76 kB
        Shannon Quinn
      16. MAHOUT-363.patch
        79 kB
        Shannon Quinn
      17. MAHOUT-363.patch
        87 kB
        Shannon Quinn
      18. MAHOUT-363.patch
        87 kB
        Shannon Quinn
      19. MAHOUT-363.patch
        94 kB
        Shannon Quinn
      20. MAHOUT-363.patch
        130 kB
        Shannon Quinn
      21. MAHOUT-363.patch
        134 kB
        Shannon Quinn
      22. MAHOUT-363.patch
        137 kB
        Shannon Quinn
      23. MAHOUT-363.patch
        137 kB
        Shannon Quinn
      24. MAHOUT-363.patch
        132 kB
        Shannon Quinn

        Activity

        Hide
        Jeff Eastman added a comment -

        Created follow-on issues MAHOUT-516, MAHOUT-517, MAHOUT-518 and MAHOUT-519 so closing this issue.

        Show
        Jeff Eastman added a comment - Created follow-on issues MAHOUT-516 , MAHOUT-517 , MAHOUT-518 and MAHOUT-519 so closing this issue.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #354 (See https://hudson.apache.org/hudson/job/Mahout-Quality/354/)
        MAHOUT-363: missed one class in last commit

        Show
        Hudson added a comment - Integrated in Mahout-Quality #354 (See https://hudson.apache.org/hudson/job/Mahout-Quality/354/ ) MAHOUT-363 : missed one class in last commit
        Hide
        Jeff Eastman added a comment -

        I may have jumped the gun a bit on the commit, but in retrospect I'm happy with the outcome for the following reasons:
        1. It is a substantial, very interesting contribution which is well-written, well-documented and well-tested. A great GSoC outcome.
        2. Committing now allows me and others to more easily collaborate on the final polishing and integration with the other clustering jobs

        • outputting values from Eigencuts. I'd suggest doing something consistent with KMeans for now. Classification convergence is down the road.
        • input preprocessing to produce the affinity tuples, or perhaps A directly (can we do something simple with DistanceMeasures?)
        • CLI integration with k-means optional arguments (Spectral KMeans)
        • Identifying, characterizing and fixing the Eigencuts logic error that you report (can you post a patch with a unit test that surfaces it?)
          3. More eyes on the code will improve its rate of convergence towards fully production-ready code
          4. In trunk, the code will automatically be kept up-to-date as other changes (e.g. to DRM, PMD/Checkstyle) come in. Less busy work for you.
          5. We have another couple of weeks of polishing time before 0.4 actually ships
          6. I hate going backwards. Pedal to the metal is more my style

        Given the above, I'd like to close this issue and open new ones for the work items in 2 so they can be tracked separately

        Show
        Jeff Eastman added a comment - I may have jumped the gun a bit on the commit, but in retrospect I'm happy with the outcome for the following reasons: 1. It is a substantial, very interesting contribution which is well-written, well-documented and well-tested. A great GSoC outcome. 2. Committing now allows me and others to more easily collaborate on the final polishing and integration with the other clustering jobs outputting values from Eigencuts. I'd suggest doing something consistent with KMeans for now. Classification convergence is down the road. input preprocessing to produce the affinity tuples, or perhaps A directly (can we do something simple with DistanceMeasures?) CLI integration with k-means optional arguments (Spectral KMeans) Identifying, characterizing and fixing the Eigencuts logic error that you report (can you post a patch with a unit test that surfaces it?) 3. More eyes on the code will improve its rate of convergence towards fully production-ready code 4. In trunk, the code will automatically be kept up-to-date as other changes (e.g. to DRM, PMD/Checkstyle) come in. Less busy work for you. 5. We have another couple of weeks of polishing time before 0.4 actually ships 6. I hate going backwards. Pedal to the metal is more my style Given the above, I'd like to close this issue and open new ones for the work items in 2 so they can be tracked separately
        Hide
        Shannon Quinn added a comment -

        (apologies for the delay)

        While I would love to have this patch included in 0.4, I'm not convinced it's ready. There is no form of output just yet; it was something I purposely held back on, since there was so much discussion surrounding the standardization of Mahout's classification/clustering formats, so rather than put in something along the lines of KMeans that would have been pulled out anyway, I essentially left it blank.

        Also, there's still a logic bug somewhere in the Eigencuts algorithm; the clustering isn't yet correct. I'm rotating this semester with Dr. Chennubhotla, and this is one of our priorities.

        I will certainly yield to and be 100% behind whatever the community thinks on this issue, but my $0.02 is that pushing it back to 0.5 might be better.

        Show
        Shannon Quinn added a comment - (apologies for the delay) While I would love to have this patch included in 0.4, I'm not convinced it's ready. There is no form of output just yet; it was something I purposely held back on, since there was so much discussion surrounding the standardization of Mahout's classification/clustering formats, so rather than put in something along the lines of KMeans that would have been pulled out anyway, I essentially left it blank. Also, there's still a logic bug somewhere in the Eigencuts algorithm; the clustering isn't yet correct. I'm rotating this semester with Dr. Chennubhotla, and this is one of our priorities. I will certainly yield to and be 100% behind whatever the community thinks on this issue, but my $0.02 is that pushing it back to 0.5 might be better.
        Hide
        Jeff Eastman added a comment -

        This patch changes the TestVectorCache test to use getTestTempDirPath("output") instead of relying upon hard-wired output. It caused the PFPGrowthTest to fail for a reason that I cannot yet identify. That test also relies upon "output" but I was unsuccessful in changing it to use getTestTempDirPath. I will open a JIRA for that as finding the dependency problem was most frustrating.

        This patch compiles and all tests run.

        Show
        Jeff Eastman added a comment - This patch changes the TestVectorCache test to use getTestTempDirPath("output") instead of relying upon hard-wired output. It caused the PFPGrowthTest to fail for a reason that I cannot yet identify. That test also relies upon "output" but I was unsuccessful in changing it to use getTestTempDirPath. I will open a JIRA for that as finding the dependency problem was most frustrating. This patch compiles and all tests run.
        Hide
        Jeff Eastman added a comment -

        This is so close. I'd really like to stretch to get it into 0.4.

        Show
        Jeff Eastman added a comment - This is so close. I'd really like to stretch to get it into 0.4.
        Hide
        Jeff Eastman added a comment -

        This looks ready for trunk and I'd like to see it in 0.4. It's well-written, passes all its tests and I'm prepared to help support it. Waiting for a nod from Shannon.

        Show
        Jeff Eastman added a comment - This looks ready for trunk and I'd like to see it in 0.4. It's well-written, passes all its tests and I'm prepared to help support it. Waiting for a nod from Shannon.
        Hide
        Jeff Eastman added a comment -

        This patch compiles and runs all of its unit tests. The PFPGrowthTest seems to bomb when I run mvn clean install but it does not fail when run individually from Eclipse. It looks to be well written code and I'd be pleased to commit it if you say it's ready for trunk. We can work out dotting the i's and crossing the t's from there. Nice work.

        Show
        Jeff Eastman added a comment - This patch compiles and runs all of its unit tests. The PFPGrowthTest seems to bomb when I run mvn clean install but it does not fail when run individually from Eclipse. It looks to be well written code and I'd be pleased to commit it if you say it's ready for trunk. We can work out dotting the i's and crossing the t's from there. Nice work.
        Hide
        Sean Owen added a comment -

        Shannon was this completed? Do we have a final patch to be committed?

        Show
        Sean Owen added a comment - Shannon was this completed? Do we have a final patch to be committed?
        Hide
        Shannon Quinn added a comment -

        I like that idea very much. However, this brings up a point of discussion: this algorithm (spectral k-means and Eigencuts itself) operate on data affinities, rather than the raw data itself. As such, we'd need to preprocess the Reuters dataset prior to running the algorithm on it, and this preprocessing doesn't necessarily have a cut-and-dry formula, so we need to decide how to produce scalar pairwise comparisons between all the documents. Isabel suggested taking the cosine similarity of the documents' vectors, and use a threshold to sparsify the data.

        This also brings up another point that was sidelined earlier in the summer in favor of simply completing the algorithm: since there are different ways of generating affinities (which are, as we see here, also data-specific), we could either have the user generate the affinities themselves (which is currently how it's done), or provide a sort of plugin architecture and/or pre-made affinity generation mechanisms so the user could still feed in the raw data. I would be happy to work on this post-GSoC (along with continued Eigencuts support, and in any other capacity with Mahout in general).

        #1 is much more important to the task at hand than #2, but still something to think about.

        Show
        Shannon Quinn added a comment - I like that idea very much. However, this brings up a point of discussion: this algorithm (spectral k-means and Eigencuts itself) operate on data affinities, rather than the raw data itself. As such, we'd need to preprocess the Reuters dataset prior to running the algorithm on it, and this preprocessing doesn't necessarily have a cut-and-dry formula, so we need to decide how to produce scalar pairwise comparisons between all the documents. Isabel suggested taking the cosine similarity of the documents' vectors, and use a threshold to sparsify the data. This also brings up another point that was sidelined earlier in the summer in favor of simply completing the algorithm: since there are different ways of generating affinities (which are, as we see here, also data-specific), we could either have the user generate the affinities themselves (which is currently how it's done), or provide a sort of plugin architecture and/or pre-made affinity generation mechanisms so the user could still feed in the raw data. I would be happy to work on this post-GSoC (along with continued Eigencuts support, and in any other capacity with Mahout in general). #1 is much more important to the task at hand than #2, but still something to think about.
        Hide
        Robin Anil added a comment -

        Hi Shannon, Could you create a demo example using reuters dataset. It will be easy to compare the clusters produced by spectral v/s rest of the clustering algorithms.

        Robin

        Show
        Robin Anil added a comment - Hi Shannon, Could you create a demo example using reuters dataset. It will be easy to compare the clusters produced by spectral v/s rest of the clustering algorithms. Robin
        Hide
        Shannon Quinn added a comment -

        A bug-fix patch. Will enumerate.

        Show
        Shannon Quinn added a comment - A bug-fix patch. Will enumerate.
        Hide
        Shannon Quinn added a comment -

        Final major commit. All unit tests are in place, and all tests pass for full Eigencuts clustering algorithm. Possible minor fixes to follow.

        Show
        Shannon Quinn added a comment - Final major commit. All unit tests are in place, and all tests pass for full Eigencuts clustering algorithm. Possible minor fixes to follow.
        Hide
        Shannon Quinn added a comment -

        Almost finished; simply need to do the manual sensitivity calculations so as to be able to check the answers of the algorithm itself. Other than this, all tests are implemented and all tests pass.

        Show
        Shannon Quinn added a comment - Almost finished; simply need to do the manual sensitivity calculations so as to be able to check the answers of the algorithm itself. Other than this, all tests are implemented and all tests pass.
        Hide
        Shannon Quinn added a comment -

        All tests are passing; only a few more tests need to be finished at this point.

        Show
        Shannon Quinn added a comment - All tests are passing; only a few more tests need to be finished at this point.
        Hide
        Shannon Quinn added a comment -

        Almost all the unit tests are fully written; however a few are still failing (most notably the sensitivity calculations; they're rather complex and I'm having difficulty tracking down where my mistakes are).

        Show
        Shannon Quinn added a comment - Almost all the unit tests are fully written; however a few are still failing (most notably the sensitivity calculations; they're rather complex and I'm having difficulty tracking down where my mistakes are).
        Hide
        Shannon Quinn added a comment -

        This patch contains all requisite functionality as per the overall project proposal. It is UNTESTED, so there is a 99.9999% chance it will NOT work (or at least, not how it's supposed to). Full-scale testing will commence immediately to iron out the bugs and make sure unit test coverage is adequate.

        Show
        Shannon Quinn added a comment - This patch contains all requisite functionality as per the overall project proposal. It is UNTESTED, so there is a 99.9999% chance it will NOT work (or at least, not how it's supposed to). Full-scale testing will commence immediately to iron out the bugs and make sure unit test coverage is adequate.
        Hide
        Shannon Quinn added a comment -

        Major refactoring underway in the data structures; turns out my previous assessment that all sensitivities could be captured in an n-length Vector was incorrect. The number of values will be somewhere between n and n^2, hence they need to be organized in a DRM format, and I will be implementing a scheme similar to that used in DRM.times() to synchronize which rows are accessed simultaneously in the M/R task.

        Drivers have been refactored to account for the new VectorMatrixMultiplicationJob (replacing the previous chained DRM.times() calls), also moved several read/write operations on Vectors to the distributed cache to a central load() and save() utility.

        Much more work yet to be done, but the finish line is in sight.

        Show
        Shannon Quinn added a comment - Major refactoring underway in the data structures; turns out my previous assessment that all sensitivities could be captured in an n-length Vector was incorrect. The number of values will be somewhere between n and n^2, hence they need to be organized in a DRM format, and I will be implementing a scheme similar to that used in DRM.times() to synchronize which rows are accessed simultaneously in the M/R task. Drivers have been refactored to account for the new VectorMatrixMultiplicationJob (replacing the previous chained DRM.times() calls), also moved several read/write operations on Vectors to the distributed cache to a central load() and save() utility. Much more work yet to be done, but the finish line is in sight.
        Hide
        Shannon Quinn added a comment -

        Finished the VectorMatrix multilplication job, and have begun on the final AffinityCuts job which performs the clustering in an iterative fashion. I will need to refactor the output of the SensitivityReducer to be a single vector (likely an ArrayWritable), as I will need the AffinityCuts job to have access to this vector when mapping across the rows of the affinity DRM; trying to access two DRMs simultaneously within a Mapper - in particular, making sure the kth row of the affinity matrix was also the kth row of the "cut" matrix - just seemed too unwieldy. Plus, if the affinity matrix is n-by-n, the "cut" vector will have, at most, n elements.

        Also need to refactor the Driver files to take advantage of the new VectorMatrix multiplication job.

        Show
        Shannon Quinn added a comment - Finished the VectorMatrix multilplication job, and have begun on the final AffinityCuts job which performs the clustering in an iterative fashion. I will need to refactor the output of the SensitivityReducer to be a single vector (likely an ArrayWritable), as I will need the AffinityCuts job to have access to this vector when mapping across the rows of the affinity DRM; trying to access two DRMs simultaneously within a Mapper - in particular, making sure the kth row of the affinity matrix was also the kth row of the "cut" matrix - just seemed too unwieldy. Plus, if the affinity matrix is n-by-n, the "cut" vector will have, at most, n elements. Also need to refactor the Driver files to take advantage of the new VectorMatrix multiplication job.
        Hide
        Shannon Quinn added a comment -

        Two of the three final jobs are complete from an M/R perspective: the sensitivity Reducer and the vector/matrix multiplication job. These two tasks still need their setup/tear-down details worked out - setting up Configuration parameters, setting the Jobs, etc - but the map/reduce methods are fully implemented. The only full task remaining is implementation of the "cut" job on the affinity matrix.

        Show
        Shannon Quinn added a comment - Two of the three final jobs are complete from an M/R perspective: the sensitivity Reducer and the vector/matrix multiplication job. These two tasks still need their setup/tear-down details worked out - setting up Configuration parameters, setting the Jobs, etc - but the map/reduce methods are fully implemented. The only full task remaining is implementation of the "cut" job on the affinity matrix.
        Hide
        Shannon Quinn added a comment -

        One of three final M/R tasks almost complete: sensitivity calculation still needs to have its Reducer implemented (will output only a single dense vector). Two other M/R tasks remain: one to compute the product of a vector with a DRM (and vice versa), and the other to perform the final step of the Eigencuts clustering algorithm - to induce "cuts" in the affinity matrix. Also, still need to refactor the drivers for Eigencuts and SpectralKMeans using the new matrix diagonlization job, which emits a single vector rather than a full DRM (hence the need for a straight vector-DRM multiplication job).

        Show
        Shannon Quinn added a comment - One of three final M/R tasks almost complete: sensitivity calculation still needs to have its Reducer implemented (will output only a single dense vector). Two other M/R tasks remain: one to compute the product of a vector with a DRM (and vice versa), and the other to perform the final step of the Eigencuts clustering algorithm - to induce "cuts" in the affinity matrix. Also, still need to refactor the drivers for Eigencuts and SpectralKMeans using the new matrix diagonlization job, which emits a single vector rather than a full DRM (hence the need for a straight vector-DRM multiplication job).
        Hide
        Shannon Quinn added a comment -

        Much of the sensitivity calculation is implemented, but untested. There is no way to get around an n^2 computation (as seen in the SensitivityMapper's map() method), but at most n values will be retained and sent off to the Reducer. Still need to test that the HDFS cache is working properly for the two constant vectors that are needed, and also need to refactor the Driver file with a new map/red task that creates a vector in the matrix diagonalization process, rather than a full DRM. Will also need to create a map/red job for multiplying that vector with a DRM.

        Show
        Shannon Quinn added a comment - Much of the sensitivity calculation is implemented, but untested. There is no way to get around an n^2 computation (as seen in the SensitivityMapper's map() method), but at most n values will be retained and sent off to the Reducer. Still need to test that the HDFS cache is working properly for the two constant vectors that are needed, and also need to refactor the Driver file with a new map/red task that creates a vector in the matrix diagonalization process, rather than a full DRM. Will also need to create a map/red job for multiplying that vector with a DRM.
        Hide
        Shannon Quinn added a comment -

        Refactored overall structure, made the jobs shared by the two algorithms more generic. Eigencuts (final iteration) now performs all computations that first iteration algorithm did. Still need to implement Eigencuts-specific M/R tasks (edge weight perturbation analysis; non-maximal suppression; edge-cutting). Also need to figure out how many eigenvectors to select; for k-means it was fairly straightforward to use the heuristic of K as the number of eigenvectors to select. Since Eigencuts does not need to specify a number of clusters, it is less obvious how to proceed.

        Show
        Shannon Quinn added a comment - Refactored overall structure, made the jobs shared by the two algorithms more generic. Eigencuts (final iteration) now performs all computations that first iteration algorithm did. Still need to implement Eigencuts-specific M/R tasks (edge weight perturbation analysis; non-maximal suppression; edge-cutting). Also need to figure out how many eigenvectors to select; for k-means it was fairly straightforward to use the heuristic of K as the number of eigenvectors to select. Since Eigencuts does not need to specify a number of clusters, it is less obvious how to proceed.
        Hide
        Shannon Quinn added a comment -

        Tweaked SpectralKMeans slightly, now produces correct clusterings (provided input data is correctly formatted). Eigencuts driver begun, still hashing out how to use the M/R tasks implemented in SpectralKMeans. Perhaps turn these into generic utilities?

        Show
        Shannon Quinn added a comment - Tweaked SpectralKMeans slightly, now produces correct clusterings (provided input data is correctly formatted). Eigencuts driver begun, still hashing out how to use the M/R tasks implemented in SpectralKMeans. Perhaps turn these into generic utilities?
        Hide
        Shannon Quinn added a comment -

        First full patch. Contains all of iteration 1, plus testing framework and outline of iteration 2. Also included the diff for the driver file, allowing this patch to be applied and the programs to be executed as is. Still working out the details for iteration 2.

        Show
        Shannon Quinn added a comment - First full patch. Contains all of iteration 1, plus testing framework and outline of iteration 2. Also included the diff for the driver file, allowing this patch to be applied and the programs to be executed as is. Still working out the details for iteration 2.
        Hide
        Shannon Quinn added a comment -

        Final sprint 1-2 commit. Everything appears to be functioning correctly. Will begin sprint 3, though may still apply fixes to sprint 1-2 code.

        Show
        Shannon Quinn added a comment - Final sprint 1-2 commit. Everything appears to be functioning correctly. Will begin sprint 3, though may still apply fixes to sprint 1-2 code.
        Hide
        Shannon Quinn added a comment -

        Everything seems to work correctly. Need to determine output format for original clustering assignments, and also why only (k - 1) unique clustering labels seem to be assigned for any given k. Will update.

        Show
        Shannon Quinn added a comment - Everything seems to work correctly. Need to determine output format for original clustering assignments, and also why only (k - 1) unique clustering labels seem to be assigned for any given k. Will update.
        Hide
        Shannon Quinn added a comment -

        Have determined that the VerificationJob is creating a blank SequenceFile of the results; parameters (error tolerance and minEigenValue) don't seem to be the cause.

        Show
        Shannon Quinn added a comment - Have determined that the VerificationJob is creating a blank SequenceFile of the results; parameters (error tolerance and minEigenValue) don't seem to be the cause.
        Hide
        Shannon Quinn added a comment -

        Persisting off-by-1 indexing errors after the Lanczos verification job.

        Show
        Shannon Quinn added a comment - Persisting off-by-1 indexing errors after the Lanczos verification job.
        Hide
        Shannon Quinn added a comment -

        Temporary modification to DistributedRowMatrix to allow for multiple .times() calls without conflicting Paths. IndexOutOfBoundsException received when attempting to generate random cluster centroids, seems to be an off-by-1 issue. Other than that, everything is in place for sprint 1.

        Show
        Shannon Quinn added a comment - Temporary modification to DistributedRowMatrix to allow for multiple .times() calls without conflicting Paths. IndexOutOfBoundsException received when attempting to generate random cluster centroids, seems to be an off-by-1 issue. Other than that, everything is in place for sprint 1.
        Hide
        Isabel Drost-Fromm added a comment -

        Assigning to Shannon as he is doing most of the implementation work.

        Show
        Isabel Drost-Fromm added a comment - Assigning to Shannon as he is doing most of the implementation work.
        Hide
        Shannon Quinn added a comment -

        All necessary functionality is in place, but testing has not completed due to exception in Driver where L is calculated by matrix multiplication. Temp Paths used for DistributedRowMatrix.times() are conflicting.

        Show
        Shannon Quinn added a comment - All necessary functionality is in place, but testing has not completed due to exception in Driver where L is calculated by matrix multiplication. Temp Paths used for DistributedRowMatrix.times() are conflicting.
        Hide
        Shannon Quinn added a comment -

        Fixed the IOException resulting from failing to properly override superclass Reducer's reduce() method in EigencutsInputReducer. Patch now runs until System.exit(0) call in Driver file.

        Show
        Shannon Quinn added a comment - Fixed the IOException resulting from failing to properly override superclass Reducer's reduce() method in EigencutsInputReducer. Patch now runs until System.exit(0) call in Driver file.
        Hide
        Shannon Quinn added a comment -

        Implemented new input format, finished M/R process for reading the input and creating DistributedRowMatrix. However, am receiving strange ClassCastException at runtime, details in EigencutsInputMapper.java.

        Show
        Shannon Quinn added a comment - Implemented new input format, finished M/R process for reading the input and creating DistributedRowMatrix. However, am receiving strange ClassCastException at runtime, details in EigencutsInputMapper.java.
        Hide
        Shannon Quinn added a comment -

        First attempt at implementing Sprint 1: generic k-means spectral clustering. Lots of the details are wrong; lots of comments explaining my thoughts at various steps. This patch is primarily for community feedback, as it compiles but definitely does not work correctly. Thanks.

        Show
        Shannon Quinn added a comment - First attempt at implementing Sprint 1: generic k-means spectral clustering. Lots of the details are wrong; lots of comments explaining my thoughts at various steps. This patch is primarily for community feedback, as it compiles but definitely does not work correctly. Thanks.
        Hide
        Shannon Quinn added a comment -

        I've been getting lot of good feedback, and after discussing my proposal at length with a few people, I'll submitting changes to the proposal timeline, particularly regarding the k-means implementation.

        Essentially, I'll cut out what is currently sprint 1, shifting sprints 2-4 back two weeks. I'll start immediately on implementing the EigenCuts algorithm, relying heavily on test-drive development and full-coverage unit testing to highlight any bugs in the EigenCuts implementation. Using the extra two weeks, I'll focus on providing more thorough documentation, including detailed specs on how to use the algorithm: given that I am working with several PhD students and professors at CMU who are interested in a map/reduce implementation of this algorithm, they have massive datasets that are readily available for use. I'll post an example dataset, along with instructions on how to feed it to Mahout, and what the output should look like.

        If there is extra time at the end, I can work on the user interface and data visualization.

        The schedule aside, there are already a few points of the implementation that I can probably address in slightly greater detail. Regarding the decomposition of the similarity matrix and computation of eigenvectors, this is the computationally difficult part and can be done several different ways. A stochastic decomposition, as proposed in http://arxiv.org/abs/0909.4061v1 (via Ted Dunning) would be one of several approaches, though from a performance perspective this looks to be one of the best. For the mechanics of parallelizing EigenCuts, the paper from ECML/PKDD 2008, "Parallel spectral clustering" (via Isabel Drost) provides an excellent starting point, in addition to using Mahout's existing map/reduce eigensolver.

        I hope this helps clarify some points and strengthen the overall feasibility of the proposal.

        Show
        Shannon Quinn added a comment - I've been getting lot of good feedback, and after discussing my proposal at length with a few people, I'll submitting changes to the proposal timeline, particularly regarding the k-means implementation. Essentially, I'll cut out what is currently sprint 1, shifting sprints 2-4 back two weeks. I'll start immediately on implementing the EigenCuts algorithm, relying heavily on test-drive development and full-coverage unit testing to highlight any bugs in the EigenCuts implementation. Using the extra two weeks, I'll focus on providing more thorough documentation, including detailed specs on how to use the algorithm: given that I am working with several PhD students and professors at CMU who are interested in a map/reduce implementation of this algorithm, they have massive datasets that are readily available for use. I'll post an example dataset, along with instructions on how to feed it to Mahout, and what the output should look like. If there is extra time at the end, I can work on the user interface and data visualization. The schedule aside, there are already a few points of the implementation that I can probably address in slightly greater detail. Regarding the decomposition of the similarity matrix and computation of eigenvectors, this is the computationally difficult part and can be done several different ways. A stochastic decomposition, as proposed in http://arxiv.org/abs/0909.4061v1 (via Ted Dunning) would be one of several approaches, though from a performance perspective this looks to be one of the best. For the mechanics of parallelizing EigenCuts, the paper from ECML/PKDD 2008, "Parallel spectral clustering" (via Isabel Drost) provides an excellent starting point, in addition to using Mahout's existing map/reduce eigensolver. I hope this helps clarify some points and strengthen the overall feasibility of the proposal.
        Hide
        Ted Dunning added a comment -

        Are there any other suggestions for making the proposal more viable?

        I think that the proposal as it stands is very strong.

        When and if it comes down to doing the work, I will be asking questions like whether it is possible to take several eigenvectors at a time instead of just one, and how this work relates to the stochastic decomposition work described at http://arxiv.org/abs/0909.4061v1 . In particular, I wonder if you could re-use the the randomized product in doing subsequent decompositions.

        But these kinds of questions aren't necessarily the kind of thing that you need to answer in the proposal ... they are just how you get side-tracked after you start.

        Show
        Ted Dunning added a comment - Are there any other suggestions for making the proposal more viable? I think that the proposal as it stands is very strong. When and if it comes down to doing the work, I will be asking questions like whether it is possible to take several eigenvectors at a time instead of just one, and how this work relates to the stochastic decomposition work described at http://arxiv.org/abs/0909.4061v1 . In particular, I wonder if you could re-use the the randomized product in doing subsequent decompositions. But these kinds of questions aren't necessarily the kind of thing that you need to answer in the proposal ... they are just how you get side-tracked after you start.
        Hide
        Shannon Quinn added a comment -

        I apologize for that. I actually only changed some of the wording; the overall proposal structure wasn't changed. But I will certainly refrain from editing the ticket itself.

        Are there any other suggestions for making the proposal more viable?

        Show
        Shannon Quinn added a comment - I apologize for that. I actually only changed some of the wording; the overall proposal structure wasn't changed. But I will certainly refrain from editing the ticket itself. Are there any other suggestions for making the proposal more viable?
        Hide
        Jake Mannix added a comment -

        If possible, Shannon, if you could simply add comments to this JIRA ticket, instead of editing the original ticket itself, we'll be able to more easily follow your thinking. Otherwise, we can't really see what has changed.

        Show
        Jake Mannix added a comment - If possible, Shannon, if you could simply add comments to this JIRA ticket, instead of editing the original ticket itself, we'll be able to more easily follow your thinking. Otherwise, we can't really see what has changed.
        Hide
        Shannon Quinn added a comment -

        Hi Robin, thanks for the suggestions! I have started playing with the code (though I should read the wiki more, excellent point), and my rationale behind the k-means implementation was to build an initial framework for the EigenCuts algorithm, around which I could build a UI that accepted input and displayed output, and also to tap into the map/reduce framework. Both would work alongside a very simple clustering technique that would be easily implemented. Essentially, the goal of the first sprint is to get the UI and map/reduce integration running; the k-means algorithm is merely a sort of test condition for those two features, given its ease of implementation.

        That's just my explanation; if you feel otherwise I'm happy to adjust my proposal

        Show
        Shannon Quinn added a comment - Hi Robin, thanks for the suggestions! I have started playing with the code (though I should read the wiki more, excellent point), and my rationale behind the k-means implementation was to build an initial framework for the EigenCuts algorithm, around which I could build a UI that accepted input and displayed output, and also to tap into the map/reduce framework. Both would work alongside a very simple clustering technique that would be easily implemented. Essentially, the goal of the first sprint is to get the UI and map/reduce integration running; the k-means algorithm is merely a sort of test condition for those two features, given its ease of implementation. That's just my explanation; if you feel otherwise I'm happy to adjust my proposal
        Hide
        Robin Anil added a comment -

        Hi Shannon, did you take time to explore the Mahout code. I believe the k-means you are looking to implement is already there it will shave 2 weeks of your GSOC . Reading the code/wiki is a great exercise for you to be more realistic in your proposal

        Show
        Robin Anil added a comment - Hi Shannon, did you take time to explore the Mahout code. I believe the k-means you are looking to implement is already there it will shave 2 weeks of your GSOC . Reading the code/wiki is a great exercise for you to be more realistic in your proposal
        Hide
        Jake Mannix added a comment -

        ... and actually, there is no need for Hama, as distributed matrix computations can already be done in Mahout natively (in particular: finding eigenvectors and eigenvalues is already done as a Map/Reduce job).

        This proposal seems to be in exactly the right level of additional work and utilization of our current set of primitive operations for a GSoC project. I wish I had the time to help with mentoring this project, in fact.

        Show
        Jake Mannix added a comment - ... and actually, there is no need for Hama, as distributed matrix computations can already be done in Mahout natively (in particular: finding eigenvectors and eigenvalues is already done as a Map/Reduce job). This proposal seems to be in exactly the right level of additional work and utilization of our current set of primitive operations for a GSoC project. I wish I had the time to help with mentoring this project, in fact.
        Hide
        Shannon Quinn added a comment -

        Thank you for the feedback!

        Cutting out Hama would certainly improve the feasibility of the project timeline and allow me to further refine the overall algorithm. I will absolutely adhere to your advice; I'll edit this ticket and my GSoC application. Thank you again!

        Show
        Shannon Quinn added a comment - Thank you for the feedback! Cutting out Hama would certainly improve the feasibility of the project timeline and allow me to further refine the overall algorithm. I will absolutely adhere to your advice; I'll edit this ticket and my GSoC application. Thank you again!
        Hide
        Ted Dunning added a comment -

        Nice proposal. Well written and well conceived.

        One tiny suggestion is to omit Hama from your plan as it would just be a distraction for you. The Hama project is pretty much independent of Mahout and there hasn't any contribution in the H->M direction.

        Show
        Ted Dunning added a comment - Nice proposal. Well written and well conceived. One tiny suggestion is to omit Hama from your plan as it would just be a distraction for you. The Hama project is pretty much independent of Mahout and there hasn't any contribution in the H->M direction.

          People

          • Assignee:
            Shannon Quinn
            Reporter:
            Shannon Quinn
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development