Mahout
  1. Mahout
  2. MAHOUT-524

DisplaySpectralKMeans example fails

    Details

      Description

      I've committed a new display example that attempts to push the standard mixture of models data set through spectral k-means. After some tweaking of configuration arguments and a bug fix in EigenCleanupJob it runs spectral k-means to completion. The display example is expecting 2-d clustered points and the example is producing 5-d points. Additional I/O work is needed before this will play with the rest of the clustering algorithms.

      1. aff.txt
        2.38 MB
        Shannon Quinn
      2. raw.txt
        14 kB
        Shannon Quinn
      3. spectralkmeans.png
        43 kB
        Shannon Quinn
      4. EclipseLog_20110918.txt
        14 kB
        Lance Norskog
      5. SpectralKMeans_fail_20110919.txt
        6 kB
        Lance Norskog
      6. MAHOUT-524.patch
        9 kB
        Grant Ingersoll
      7. MAHOUT-524.patch
        5 kB
        Grant Ingersoll
      8. ASF.LICENSE.NOT.GRANTED--screenshot-1.jpg
        46 kB
        Grant Ingersoll
      9. MAHOUT-524.patch
        13 kB
        Shannon Quinn
      There are no Sub-Tasks for this issue.

        Activity

        Hide
        Dan Brickley added a comment -

        Shannon informs me I'm getting this error because node IDs must be counted from zero. I've updated the wiki to say this more explicitly. So this JIRA can stay closed, phew.

        Show
        Dan Brickley added a comment - Shannon informs me I'm getting this error because node IDs must be counted from zero. I've updated the wiki to say this more explicitly. So this JIRA can stay closed, phew.
        Hide
        Dan Brickley added a comment -

        I just tried spectral k-means with some wikipedia/dbpedia data (1.0 affinities for every page and topic category URL pair in the Wiki. Data came from http://downloads.dbpedia.org/3.7/en/article_categories_en.nt.bz2 and is dropped in the Web at http://danbri.org/2012/spectral/dbpedia/ (I posted .csv plus an int-to-URL dictionary file).

        My best guess at commandline (running this w/ today's trunk + a fresh 0.20.203.0 hadoop pseudo-cluster) was this:

        mahout spectralkmeans -i wiki/ -o output1 -k 20 -d 4192499 --maxIter 10 (where hdfs wiki/ subdir contains the .csv data file)

        Unfortunately I'm hitting one of the various problems discussed above. If anyone else can reproduce this, perhaps a fresh JIRA is needed.

        It gets stuck after the first job, with an essentially empty seqfile. Full transcript here: https://gist.github.com/1804016

        (checked with "mahout seqdumper --seqFile output1/calculations/diagonal/part-r-00000")

        This is essentially the same experience I had back in Sept (see above) running a similar test.

        Show
        Dan Brickley added a comment - I just tried spectral k-means with some wikipedia/dbpedia data (1.0 affinities for every page and topic category URL pair in the Wiki. Data came from http://downloads.dbpedia.org/3.7/en/article_categories_en.nt.bz2 and is dropped in the Web at http://danbri.org/2012/spectral/dbpedia/ (I posted .csv plus an int-to-URL dictionary file). My best guess at commandline (running this w/ today's trunk + a fresh 0.20.203.0 hadoop pseudo-cluster) was this: mahout spectralkmeans -i wiki/ -o output1 -k 20 -d 4192499 --maxIter 10 (where hdfs wiki/ subdir contains the .csv data file) Unfortunately I'm hitting one of the various problems discussed above. If anyone else can reproduce this, perhaps a fresh JIRA is needed. It gets stuck after the first job, with an essentially empty seqfile. Full transcript here: https://gist.github.com/1804016 (checked with "mahout seqdumper --seqFile output1/calculations/diagonal/part-r-00000") This is essentially the same experience I had back in Sept (see above) running a similar test.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1279 (See https://builds.apache.org/job/Mahout-Quality/1279/)
        MAHOUT-524: committing patch since Shannon has no internet. All tests run

        jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225596
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/VectorMatrixMultiplicationJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/kmeans/SpectralKMeansDriver.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayClustering.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplaySpectralKMeans.java
        • /mahout/trunk/math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
        • /mahout/trunk/math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosState.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1279 (See https://builds.apache.org/job/Mahout-Quality/1279/ ) MAHOUT-524 : committing patch since Shannon has no internet. All tests run jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225596 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/VectorMatrixMultiplicationJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/kmeans/SpectralKMeansDriver.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayClustering.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplaySpectralKMeans.java /mahout/trunk/math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java /mahout/trunk/math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosState.java
        Hide
        Jeff Eastman added a comment -

        Shannon, is this patch ready to commit? I've installed it and verified that DisplaySpectralKMeans is indeed finding clusters. By increasing the numClusters from 2 to 3 it now does a credible job of finding the 3 clusters present in the generated data.

        Show
        Jeff Eastman added a comment - Shannon, is this patch ready to commit? I've installed it and verified that DisplaySpectralKMeans is indeed finding clusters. By increasing the numClusters from 2 to 3 it now does a credible job of finding the 3 clusters present in the generated data.
        Hide
        Dan Brickley added a comment -

        Great to see this getting wrapped up. Can you suggest what commandline(s) and test input others might try to verify this?

        I have some py-generated afftest.txt left from previous investigations but forget its exact origins.

        I also have some real world similarity data with labeled items; how would I use those?

        Show
        Dan Brickley added a comment - Great to see this getting wrapped up. Can you suggest what commandline(s) and test input others might try to verify this? I have some py-generated afftest.txt left from previous investigations but forget its exact origins. I also have some real world similarity data with labeled items; how would I use those?
        Hide
        Shannon Quinn added a comment -

        This patch includes the contents of Grant's Nov 11 patch, which fixes the "/tmp/data" path error. It also adds a few methods to DisplayClustering.java to color the points according to their cluster assignments, rather than drawing ellipses around the cluster centroids. Finally, it includes minor edits of SKMD in terms of the output format.

        Show
        Shannon Quinn added a comment - This patch includes the contents of Grant's Nov 11 patch, which fixes the "/tmp/data" path error. It also adds a few methods to DisplayClustering.java to color the points according to their cluster assignments, rather than drawing ellipses around the cluster centroids. Finally, it includes minor edits of SKMD in terms of the output format.
        Hide
        Shannon Quinn added a comment -

        I believe you and Rozemary need to apply the patch that is attached to this issue to get past the "tmp/data" error. It stems from the Lanczos solver, but is likely a symptom of being called by SKM incorrectly.

        I'm still working on the patch for this, will hopefully be done soon...

        Show
        Shannon Quinn added a comment - I believe you and Rozemary need to apply the patch that is attached to this issue to get past the "tmp/data" error. It stems from the Lanczos solver, but is likely a symptom of being called by SKM incorrectly. I'm still working on the patch for this, will hopefully be done soon...
        Hide
        Kevin Findlay added a comment -

        Slightly confused I have checked that the Mahout-524 patches are included in my current build of the trunk.

        However I still get the file not fount error "tmp/data" as decribed in the Subtask.

        Have I got the versions right?

        Show
        Kevin Findlay added a comment - Slightly confused I have checked that the Mahout-524 patches are included in my current build of the trunk. However I still get the file not fount error "tmp/data" as decribed in the Subtask. Have I got the versions right?
        Hide
        Rozemary Scarlat added a comment -

        Hi! I am new to Mahout and I have been trying to use the K-means Spectral Clustering, but I ran into the problem described in the comments above: the Lancsoz solver tries to input the output of the VectorMatrixMultipliction as a "calculations/laplacian-166/tmp/data" file, instead of the "calculations/laplacian-166/part-m-00000".
        I was wondering if currently there is a way to a run the Spectral Clustering.

        Show
        Rozemary Scarlat added a comment - Hi! I am new to Mahout and I have been trying to use the K-means Spectral Clustering, but I ran into the problem described in the comments above: the Lancsoz solver tries to input the output of the VectorMatrixMultipliction as a "calculations/laplacian-166/tmp/data" file, instead of the "calculations/laplacian-166/part-m-00000". I was wondering if currently there is a way to a run the Spectral Clustering.
        Hide
        Shannon Quinn added a comment -

        Unknown. Still coding up a way of coloring the dots rather than drawing circles.

        Show
        Shannon Quinn added a comment - Unknown. Still coding up a way of coloring the dots rather than drawing circles.
        Hide
        Dan Brickley added a comment -

        (hmm this issue seems something of a proxy for general code rot and problems with the spectral piece of Mahout)

        Where are we with this? I see "a symptom of us calling the job wrong", and "throwing off the final results". Is the problem purely in the displaying of spectral k-means, or something deeper e.g. if I want eigenvectors and values of laplacian re-representation of an affinity matrix, is the underlying code in a happy state?

        Show
        Dan Brickley added a comment - (hmm this issue seems something of a proxy for general code rot and problems with the spectral piece of Mahout) Where are we with this? I see "a symptom of us calling the job wrong", and "throwing off the final results". Is the problem purely in the displaying of spectral k-means, or something deeper e.g. if I want eigenvectors and values of laplacian re-representation of an affinity matrix, is the underlying code in a happy state?
        Hide
        Grant Ingersoll added a comment -

        If at all possible, my suggestion would be colored dots to indicate the clusters.

        There is no requirement that we have to draw circles or leverage the old code, we just need something that works.

        Show
        Grant Ingersoll added a comment - If at all possible, my suggestion would be colored dots to indicate the clusters. There is no requirement that we have to draw circles or leverage the old code, we just need something that works.
        Hide
        Shannon Quinn added a comment - - edited

        After implementing the same code in Python, my suspicions are actually that the clusters of the K-means at the conclusion of the spectral algorithm are throwing off the final results shown in DisplaySKM. Regular K-means is running on the spectral data: the top k-eigenvectors of the affinities, rather than the original data. I don't know K-means well enough to know for sure, but my guess is that all the distance measurements that come back in its output format are relative to the spectral data, rather than the original data. So what you see in the end-result graph are circles around where the spectral data are.

        That'd be my first guess, anyway. I'm working on a couple things to help with this: a sequential version of spectral k-means, and a job to read raw data (text format: whitespace or comma-separated n-dimensional points) and convert it to affinities (a la issue 518, finally!). Hopefully these will help diagnose spectral k-means.

        But if it is a data issue, I'm not sure how we can translate the distance measurements on the spectral data back onto the original data for the DisplaySKM code. I would argue, though, that since spectral k-means doesn't operate on the same GMM-type basis that regular K-means does, overlaying K gaussians isn't really what we want here, anyway. If at all possible, my suggestion would be colored dots to indicate the clusters.

        Show
        Shannon Quinn added a comment - - edited After implementing the same code in Python, my suspicions are actually that the clusters of the K-means at the conclusion of the spectral algorithm are throwing off the final results shown in DisplaySKM. Regular K-means is running on the spectral data: the top k-eigenvectors of the affinities, rather than the original data. I don't know K-means well enough to know for sure, but my guess is that all the distance measurements that come back in its output format are relative to the spectral data, rather than the original data. So what you see in the end-result graph are circles around where the spectral data are. That'd be my first guess, anyway. I'm working on a couple things to help with this: a sequential version of spectral k-means, and a job to read raw data (text format: whitespace or comma-separated n-dimensional points) and convert it to affinities (a la issue 518, finally!). Hopefully these will help diagnose spectral k-means. But if it is a data issue, I'm not sure how we can translate the distance measurements on the spectral data back onto the original data for the DisplaySKM code. I would argue, though, that since spectral k-means doesn't operate on the same GMM-type basis that regular K-means does, overlaying K gaussians isn't really what we want here, anyway. If at all possible, my suggestion would be colored dots to indicate the clusters.
        Hide
        Grant Ingersoll added a comment -

        I applied your patch but I'm having trouble following where you fixed the Lanczos issue

        I put in a sanity check at

        int size = ejCol.size();
        for (int j = 0; j < size; j++) {
        

        so that we don't overrun the basis vector size.

        however, based on Jake's comments, I'd say that is a symptom of us calling the job wrong.

        Show
        Grant Ingersoll added a comment - I applied your patch but I'm having trouble following where you fixed the Lanczos issue I put in a sanity check at int size = ejCol.size(); for ( int j = 0; j < size; j++) { so that we don't overrun the basis vector size. however, based on Jake's comments, I'd say that is a symptom of us calling the job wrong.
        Hide
        Grant Ingersoll added a comment -

        I applied your patch but I'm having trouble following where you fixed the Lanczos issue (though from within Eclipse I'm getting OutOfMemory errors...).

        Yeah, up your heap to 1024M

        Show
        Grant Ingersoll added a comment - I applied your patch but I'm having trouble following where you fixed the Lanczos issue (though from within Eclipse I'm getting OutOfMemory errors...). Yeah, up your heap to 1024M
        Hide
        Jake Mannix added a comment -

        I don't really know anything about the way that SKMD works, so all I can weight in is what's going on in Lanczos:

        You take an input matrix with some number of rows (this number doesn't matter, doesn't show up anywhere) and numCols columns (this number matters a lot). You want desiredRank eigenvectors to pop out in the end. So you start with some initial basisVector (number 0), and you iterate again and again taking your input corpus.timesSquared(basisIminusOne) (resultant vector is of size numCols), do some orthogonalization against previous vectors, hang onto this vector.

        Eventually you have desiredRank basisVectors, arranged in the LanczosState object in a Map<Integer,Vector> (it could be a Matrix, certainly, it is, but we're just hanging onto it before building a matrix soon enough). Meanwhile, we're building up a desiredRank x desiredRank tri-diagonal (ie very sparse) matrix using these basis vectors and their inner products.

        Now we ask COLT to get the eigenvectors and eigenvalues of the tridiagonal matrix, there will be desiredRank eigenvalues, and desiredRank eigenVectors (each of dimension desiredRank).

        Here we get to where you're getting an NPE. We walk along the desiredRank^2 values in the eigenvector matrix ("eigenVects"), and for each of 0... desiredRank, we grab the basisVector (we have desiredRank of them, each of size numCols) and add a linear multiple of it onto something which will be the final eigenvector we'll return at the end of the day.

        What is SKMD doing?

        [code]
        LanczosState state = new LanczosState(L, overshoot, numDims, solver.getInitialVector(L));
        Path lanczosSeqFiles = new Path(outputCalc, "eigenvectors-" + (System.nanoTime() & 0xFF));
        solver.runJob(conf,
        state,
        overshoot,
        true,
        lanczosSeqFiles.toString());
        [code]

        We're making a LanczosState with specifying numCols = overshoot, desiredRank = numDims.

        Then we run the solver with desiredRank = overshoot.

        Looks like this is inconsistent, the desiredRank should be the same?

        Show
        Jake Mannix added a comment - I don't really know anything about the way that SKMD works, so all I can weight in is what's going on in Lanczos: You take an input matrix with some number of rows (this number doesn't matter, doesn't show up anywhere) and numCols columns (this number matters a lot). You want desiredRank eigenvectors to pop out in the end. So you start with some initial basisVector (number 0), and you iterate again and again taking your input corpus.timesSquared(basisIminusOne) (resultant vector is of size numCols), do some orthogonalization against previous vectors, hang onto this vector. Eventually you have desiredRank basisVectors, arranged in the LanczosState object in a Map<Integer,Vector> (it could be a Matrix, certainly, it is, but we're just hanging onto it before building a matrix soon enough). Meanwhile, we're building up a desiredRank x desiredRank tri-diagonal (ie very sparse) matrix using these basis vectors and their inner products. Now we ask COLT to get the eigenvectors and eigenvalues of the tridiagonal matrix, there will be desiredRank eigenvalues, and desiredRank eigenVectors (each of dimension desiredRank). Here we get to where you're getting an NPE. We walk along the desiredRank^2 values in the eigenvector matrix ("eigenVects"), and for each of 0... desiredRank, we grab the basisVector (we have desiredRank of them, each of size numCols) and add a linear multiple of it onto something which will be the final eigenvector we'll return at the end of the day. What is SKMD doing? [code] LanczosState state = new LanczosState(L, overshoot, numDims, solver.getInitialVector(L)); Path lanczosSeqFiles = new Path(outputCalc, "eigenvectors-" + (System.nanoTime() & 0xFF)); solver.runJob(conf, state, overshoot, true, lanczosSeqFiles.toString()); [code] We're making a LanczosState with specifying numCols = overshoot, desiredRank = numDims. Then we run the solver with desiredRank = overshoot. Looks like this is inconsistent, the desiredRank should be the same?
        Hide
        Jeff Eastman added a comment - - edited

        This result looks like the original result I got when it worked for a while. I'm treating the SKMD output as though it were clusters like the other Display routines. I think this is not correct but I don't understand what is wrong.

        Show
        Jeff Eastman added a comment - - edited This result looks like the original result I got when it worked for a while. I'm treating the SKMD output as though it were clusters like the other Display routines. I think this is not correct but I don't understand what is wrong.
        Hide
        Shannon Quinn added a comment -

        Similar results were actually what this issue was originally created to solve, before code rot created the other problems. The fact that I got actual clustering results when I was testing this code two summers ago would seem to imply that it's an API issue; DisplaySKM vs SKMDriver data format clashes would be my first guess.

        I applied your patch but I'm having trouble following where you fixed the Lanczos issue (though from within Eclipse I'm getting OutOfMemory errors...).

        Show
        Shannon Quinn added a comment - Similar results were actually what this issue was originally created to solve, before code rot created the other problems. The fact that I got actual clustering results when I was testing this code two summers ago would seem to imply that it's an API issue; DisplaySKM vs SKMDriver data format clashes would be my first guess. I applied your patch but I'm having trouble following where you fixed the Lanczos issue (though from within Eclipse I'm getting OutOfMemory errors...).
        Hide
        Grant Ingersoll added a comment -

        Of course, the results don't really speak well of SKM, but at least there are some results!

        Show
        Grant Ingersoll added a comment - Of course, the results don't really speak well of SKM, but at least there are some results!
        Hide
        Grant Ingersoll added a comment -

        This gets past the Lanczos issue by checking the size. __ I HAVE NO IDEA IF THIS IS VALID MATHEMATICALLY__, but it does show results!

        Show
        Grant Ingersoll added a comment - This gets past the Lanczos issue by checking the size. __ I HAVE NO IDEA IF THIS IS VALID MATHEMATICALLY__, but it does show results!
        Hide
        Grant Ingersoll added a comment -

        Seems the numDims == 1100 there is supposed to be the size of the affinity matrix, which is what we have generated from the sample data, so I guess that makes sense.

        Show
        Grant Ingersoll added a comment - Seems the numDims == 1100 there is supposed to be the size of the affinity matrix, which is what we have generated from the sample data, so I guess that makes sense.
        Hide
        Grant Ingersoll added a comment -

        I guess the 1100 comes from how we are calling all of this:

        SpectralKMeansDriver.run(new Configuration(), affinities, output, 1100, 2, measure, convergenceDelta, maxIter);
        Show
        Grant Ingersoll added a comment - I guess the 1100 comes from how we are calling all of this: SpectralKMeansDriver.run( new Configuration(), affinities, output, 1100, 2, measure, convergenceDelta, maxIter);
        Hide
        Grant Ingersoll added a comment -

        in this particular case, the state has 4 basis vectors, but the "size" that j is being iterated over is 1100. Someone isn't going to be happy. I can see the easy fix (don't loop past that), but I don't know enough about Lanczos or SKMD to know whether what we are seeing is an artifact of SKMD or if this is a bug in Lanzcos.

        Show
        Grant Ingersoll added a comment - in this particular case, the state has 4 basis vectors, but the "size" that j is being iterated over is 1100. Someone isn't going to be happy. I can see the easy fix (don't loop past that), but I don't know enough about Lanczos or SKMD to know whether what we are seeing is an artifact of SKMD or if this is a bug in Lanzcos.
        Hide
        Grant Ingersoll added a comment -

        The NPE is from one of the rowJ values being null (the 4th one). Line 156 in Lanzcos:

         Vector rowJ = state.getBasisVector(j);

        This looks like an issue in Lanzcos. Namely, we are assuming the size of the basis vectors from the state matches the same size of the ejCol stuff. Of course, this might mean SKMD is doing something wrong. Perhaps Jake can weigh in here.

        Show
        Grant Ingersoll added a comment - The NPE is from one of the rowJ values being null (the 4th one). Line 156 in Lanzcos: Vector rowJ = state.getBasisVector(j); This looks like an issue in Lanzcos. Namely, we are assuming the size of the basis vectors from the state matches the same size of the ejCol stuff. Of course, this might mean SKMD is doing something wrong. Perhaps Jake can weigh in here.
        Hide
        Grant Ingersoll added a comment -

        patch so far, never mind the DisplayMinHash stuff, as I forgot to clean it up

        Show
        Grant Ingersoll added a comment - patch so far, never mind the DisplayMinHash stuff, as I forgot to clean it up
        Hide
        Shannon Quinn added a comment -

        I'm just now getting in on this (my environment completely died after a failed attempt to upgrade from Ubuntu 10.04 to 10.10...). Could the NullPointerException have anything to do SKMD invoking the runJob() in the LanczosSolver that I alluded to in my previous comment, i.e. the one for which SKMD is the only caller?

        Show
        Shannon Quinn added a comment - I'm just now getting in on this (my environment completely died after a failed attempt to upgrade from Ubuntu 10.04 to 10.10...). Could the NullPointerException have anything to do SKMD invoking the runJob() in the LanczosSolver that I alluded to in my previous comment, i.e. the one for which SKMD is the only caller?
        Hide
        Grant Ingersoll added a comment -

        Making this change does indeed get us well past that problem and leads to:

        Exception in thread "main" java.lang.NullPointerException
        at org.apache.mahout.math.DenseVector.assign(DenseVector.java:133)
        at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160)
        at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.runJob(DistributedLanczosSolver.java:72)
        at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:155)
        at org.apache.mahout.clustering.display.DisplaySpectralKMeans.main(DisplaySpectralKMeans.java:72)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

        Not sure if that is a direct correlation to my change or not, but continue to debug

        Show
        Grant Ingersoll added a comment - Making this change does indeed get us well past that problem and leads to: Exception in thread "main" java.lang.NullPointerException at org.apache.mahout.math.DenseVector.assign(DenseVector.java:133) at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.runJob(DistributedLanczosSolver.java:72) at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:155) at org.apache.mahout.clustering.display.DisplaySpectralKMeans.main(DisplaySpectralKMeans.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Not sure if that is a direct correlation to my change or not, but continue to debug
        Hide
        Grant Ingersoll added a comment -

        REalizing now that Jeff already said that above. Digging deeper, however, it seems to me that the issue is Hadoop is not expecting there to be a directory (tmp) in that directory. From the looks of it, we just want the part-m-**** file in there, but file status is also returning the tmp dir that gets created when we do:

        DistributedRowMatrix L =
                VectorMatrixMultiplicationJob.runJob(affSeqFiles, D,
                    new Path(outputCalc, "laplacian-" + (System.nanoTime() & 0xFF)));
        

        on line 142 of SpectralKMeansDriver. I wonder if we simply put that tmp directory elsewhere, or make sure that it is deleted when that job is done and all will be well?

        Perhaps a red herring, testing more.

        Show
        Grant Ingersoll added a comment - REalizing now that Jeff already said that above. Digging deeper, however, it seems to me that the issue is Hadoop is not expecting there to be a directory (tmp) in that directory. From the looks of it, we just want the part-m-**** file in there, but file status is also returning the tmp dir that gets created when we do: DistributedRowMatrix L = VectorMatrixMultiplicationJob.runJob(affSeqFiles, D, new Path(outputCalc, "laplacian-" + ( System .nanoTime() & 0xFF))); on line 142 of SpectralKMeansDriver. I wonder if we simply put that tmp directory elsewhere, or make sure that it is deleted when that job is done and all will be well? Perhaps a red herring, testing more.
        Hide
        Grant Ingersoll added a comment -

        Tracing into the Hadoop code, this "data" dir gets appended via a MapFile. For some reason it thinks it has a MapFile here, so it points to something is not getting configured correctly.

        Show
        Grant Ingersoll added a comment - Tracing into the Hadoop code, this "data" dir gets appended via a MapFile. For some reason it thinks it has a MapFile here, so it points to something is not getting configured correctly.
        Hide
        Grant Ingersoll added a comment -

        Is there any way we could simplify TimesSquaredJob

        Seems like there is an awful log of deprecated Hadoop stuff in there.

        Show
        Grant Ingersoll added a comment - Is there any way we could simplify TimesSquaredJob Seems like there is an awful log of deprecated Hadoop stuff in there.
        Hide
        Dan Brickley added a comment -

        Shannon, "I'll investigate the manipulation of Configuration objects in SKMD" ... did you get a chance to do that?

        Show
        Dan Brickley added a comment - Shannon, "I'll investigate the manipulation of Configuration objects in SKMD" ... did you get a chance to do that?
        Hide
        Shannon Quinn added a comment -

        If there are two DLS.runJob() methods and the spectral code is the only bit of code that calls one of the two runJob() methods, then in the interest of making the codebase just a tiny bit more maintainable I would vote for switching out the runJob() invoked by the spectral code and deleting the other one in DLS entirely.

        Regarding your tracing of the DRM.times() method, I was having the same problem: the fact that there exist so many chained job constructors makes it difficult to follow. Is there any way we could simplify TimesSquaredJob? Are each of those job creation methods called multiple times throughout the code base?

        Regarding this issue, it sounds like the problem either resides in TimesSquared not correctly setting the path as you mentioned (but this begs the question why no other algorithm which uses DRM.times() is running into the same problem), or the Configuration voodoo in SKMD is causing problems.

        I'll investigate the manipulation of Configuration objects in SKMD this week. If you have any thoughts on the other points, please let me know.

        Show
        Shannon Quinn added a comment - If there are two DLS.runJob() methods and the spectral code is the only bit of code that calls one of the two runJob() methods, then in the interest of making the codebase just a tiny bit more maintainable I would vote for switching out the runJob() invoked by the spectral code and deleting the other one in DLS entirely. Regarding your tracing of the DRM.times() method, I was having the same problem: the fact that there exist so many chained job constructors makes it difficult to follow. Is there any way we could simplify TimesSquaredJob? Are each of those job creation methods called multiple times throughout the code base? Regarding this issue, it sounds like the problem either resides in TimesSquared not correctly setting the path as you mentioned (but this begs the question why no other algorithm which uses DRM.times() is running into the same problem), or the Configuration voodoo in SKMD is causing problems. I'll investigate the manipulation of Configuration objects in SKMD this week. If you have any thoughts on the other points, please let me know.
        Hide
        Jeff Eastman added a comment -

        I'm running in the Eclipse debugger, debugging DisplaySpectralKMeans. This runs in local mode, and fails as reported above.

        Show
        Jeff Eastman added a comment - I'm running in the Eclipse debugger, debugging DisplaySpectralKMeans. This runs in local mode, and fails as reported above.
        Hide
        Dan Brickley added a comment -

        re Sean's "I'd restart your cluster."; should it be fine to run the whole thing in MAHOUT_LOCAL=true mode, and bypass any complexity / issues from having a separate Hadoop cluster / pseudo-cluster?

        Show
        Dan Brickley added a comment - re Sean's "I'd restart your cluster."; should it be fine to run the whole thing in MAHOUT_LOCAL=true mode, and bypass any complexity / issues from having a separate Hadoop cluster / pseudo-cluster?
        Hide
        Jeff Eastman added a comment -

        All of this is buried inside of DistributedLanczosSolver. Either the problem resides in there and should impact all users of DLS or it is in the SpectralKMeansDriver setup which invokes the DLS. Turns out the DLS.runJob(...) method employed (line 65) is only called by spectral clustering (KMeans and Eigencuts). The one other caller, DLS.runJob(...) (line 80) is itself never called.

        Just looking at the invocation site (SpectralKMeansDriver.run() line 155, I see two file paths being passed into DLS.runJob(...): the lanczosSeqFiles path is output/calculations/eigenvectors-17, the desired output path, and the LanczosState is constructed with L, a DRM with inputPath examples/output/calculations/laplacian-89. This is the input path which is failing in getFileStatus and causing the exception. Both of these look reasonable to me.

        There are; however, several different Configuration objects being manipulated by SKMD. I'm suspicious there is something horked in one of them which is causing the DLS file not found.

        Show
        Jeff Eastman added a comment - All of this is buried inside of DistributedLanczosSolver. Either the problem resides in there and should impact all users of DLS or it is in the SpectralKMeansDriver setup which invokes the DLS. Turns out the DLS.runJob(...) method employed (line 65) is only called by spectral clustering (KMeans and Eigencuts). The one other caller, DLS.runJob(...) (line 80) is itself never called. Just looking at the invocation site (SpectralKMeansDriver.run() line 155, I see two file paths being passed into DLS.runJob(...): the lanczosSeqFiles path is output/calculations/eigenvectors-17, the desired output path, and the LanczosState is constructed with L, a DRM with inputPath examples/output/calculations/laplacian-89. This is the input path which is failing in getFileStatus and causing the exception. Both of these look reasonable to me. There are; however, several different Configuration objects being manipulated by SKMD. I'm suspicious there is something horked in one of them which is causing the DLS file not found.
        Hide
        Jeff Eastman added a comment -

        I've found where the /data is being added to the input path: its in SequenceFileInputFormat.listStatus(JobConf). Here is where MapFile.DATA_FILE_NAME is appended to get the dataFile path. This seems to not be the source of the problem; however, rather I'm looking in DRM.times() where it calls TimesSquaredJob.createTimesJobConf(...). Looks to me like this method is setting the conf feature "DistributedMatrix.times.inputVector" to the correct file path (examples/output/calculations/laplacian-25/tmp/<ts>/DistributedMatrix.times.inputVector/<ts>), but is not setting the job's input paths, since FileInputFormat.getInputPaths(new JobConf(conf)) returns only "examples/output/calculations/laplacian-25".

        By the time the thread gets to listStatus() after kicking off DRM.times(), the JobConf input paths contain only "examples/output/calculations/laplacian-113/tmp" and /data is appended to that.

        The whole handling of Configurations and JobConfs is very twisted and difficult to follow.

        Show
        Jeff Eastman added a comment - I've found where the /data is being added to the input path: its in SequenceFileInputFormat.listStatus(JobConf). Here is where MapFile.DATA_FILE_NAME is appended to get the dataFile path. This seems to not be the source of the problem; however, rather I'm looking in DRM.times() where it calls TimesSquaredJob.createTimesJobConf(...). Looks to me like this method is setting the conf feature "DistributedMatrix.times.inputVector" to the correct file path (examples/output/calculations/laplacian-25/tmp/<ts>/DistributedMatrix.times.inputVector/<ts>), but is not setting the job's input paths, since FileInputFormat.getInputPaths(new JobConf(conf)) returns only "examples/output/calculations/laplacian-25". By the time the thread gets to listStatus() after kicking off DRM.times(), the JobConf input paths contain only "examples/output/calculations/laplacian-113/tmp" and /data is appended to that. The whole handling of Configurations and JobConfs is very twisted and difficult to follow.
        Hide
        Sean Owen added a comment -

        That again looks like an environment issue; the reducer couldn't get data off the mapper. I don't know why in this case; you'd have to dig in to logs. I'd restart your cluster.

        Show
        Sean Owen added a comment - That again looks like an environment issue; the reducer couldn't get data off the mapper. I don't know why in this case; you'd have to dig in to logs. I'd restart your cluster.
        Hide
        Lance Norskog added a comment -

        Yes, it was MAVEN_OPTS; that one helps.

        With today's patch (Sep. 20, 2011 setting the job jars), I get (eventually) this error:

        11/09/20 23:10:35 INFO mapred.JobClient: Running job: job_201109191821_0016
        11/09/20 23:10:36 INFO mapred.JobClient:  map 0% reduce 0%
        11/09/20 23:11:01 INFO mapred.JobClient:  map 100% reduce 0%
        11/09/20 23:11:15 INFO mapred.JobClient: Task Id : attempt_201109191821_0016_r_000000_0, Status : FAILED
        Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
        11/09/20 23:11:15 WARN mapred.JobClient: Error reading task outputHost is down
        11/09/20 23:11:15 WARN mapred.JobClient: Error reading task outputHost is down
        11/09/20 23:12:00 INFO mapred.JobClient: Task Id : attempt_201109191821_0016_r_000000_1, Status : FAILED
        Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
        

        I stripped aff.txt down to a file with 20 nodes, and get the above error. This is on a single-node cluster on my laptop. Is it possible to run this job on such a small device? (If not, then DisplaySpectralKMeans as a Swing app might not be realistic .

        Show
        Lance Norskog added a comment - Yes, it was MAVEN_OPTS; that one helps. With today's patch (Sep. 20, 2011 setting the job jars), I get (eventually) this error: 11/09/20 23:10:35 INFO mapred.JobClient: Running job: job_201109191821_0016 11/09/20 23:10:36 INFO mapred.JobClient: map 0% reduce 0% 11/09/20 23:11:01 INFO mapred.JobClient: map 100% reduce 0% 11/09/20 23:11:15 INFO mapred.JobClient: Task Id : attempt_201109191821_0016_r_000000_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. 11/09/20 23:11:15 WARN mapred.JobClient: Error reading task outputHost is down 11/09/20 23:11:15 WARN mapred.JobClient: Error reading task outputHost is down 11/09/20 23:12:00 INFO mapred.JobClient: Task Id : attempt_201109191821_0016_r_000000_1, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. I stripped aff.txt down to a file with 20 nodes, and get the above error. This is on a single-node cluster on my laptop. Is it possible to run this job on such a small device? (If not, then DisplaySpectralKMeans as a Swing app might not be realistic .
        Hide
        Shannon Quinn added a comment -

        The full fix is MAHOUT-518 (in progress), where you no longer have to input affinity but instead raw data. I can certainly edit the affinity input for the time being, but once 518 is finished this point will be moot.

        Show
        Shannon Quinn added a comment - The full fix is MAHOUT-518 (in progress), where you no longer have to input affinity but instead raw data. I can certainly edit the affinity input for the time being, but once 518 is finished this point will be moot.
        Hide
        Sean Owen added a comment -

        OK is there an easy fix for your first point? Seems like a matter of input parsing

        Show
        Sean Owen added a comment - OK is there an easy fix for your first point? Seems like a matter of input parsing
        Hide
        Shannon Quinn added a comment - - edited

        Sean: #4 is actually an off-by-one error that is the result of specifying "dimensions 37" when they are indexed in the input file as 1-37, when the program is expecting 0-36. Changing the input parameter to "dimensions 38" is kind of a fix, although it will result in the first row and first column of Mahout's internal representation of the affinity matrix to be all 0s.

        Regarding the jobs, I have no idea how they ran previously; I never ran into that problem when first writing the jobs. Apparently there's a widely-employed use-case I simply didn't test?

        Beyond that, still can't find the source of the error in the attached EclipseLog; wherever that "/tmp" is being appended at the end, it isn't in any of the core Mahout code.

        Show
        Shannon Quinn added a comment - - edited Sean: #4 is actually an off-by-one error that is the result of specifying "dimensions 37" when they are indexed in the input file as 1-37, when the program is expecting 0-36. Changing the input parameter to "dimensions 38" is kind of a fix, although it will result in the first row and first column of Mahout's internal representation of the affinity matrix to be all 0s. Regarding the jobs, I have no idea how they ran previously; I never ran into that problem when first writing the jobs. Apparently there's a widely-employed use-case I simply didn't test? Beyond that, still can't find the source of the error in the attached EclipseLog; wherever that "/tmp" is being appended at the end, it isn't in any of the core Mahout code.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1051 (See https://builds.apache.org/job/Mahout-Quality/1051/)
        MAHOUT-524 added danbri's setJarByClass() patch and logging

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1172995
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/AffinityMatrixInputJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/AffinityMatrixInputMapper.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/MatrixDiagonalizeJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/UnitVectorizerJob.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/VectorCache.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/VectorMatrixMultiplicationJob.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1051 (See https://builds.apache.org/job/Mahout-Quality/1051/ ) MAHOUT-524 added danbri's setJarByClass() patch and logging srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1172995 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/AffinityMatrixInputJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/AffinityMatrixInputMapper.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/MatrixDiagonalizeJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/UnitVectorizerJob.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/VectorCache.java /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/spectral/common/VectorMatrixMultiplicationJob.java
        Hide
        Sean Owen added a comment -

        Lance, in your command line you use "MAVENOPTS" and not "MAVEN_OPTS". Is that the issue?

        I think I agree with Dan's patch, but wonder how these jobs ever worked otherwise? But yes everything needs to call setJar() or setJarByClass(). AbstractJob takes care of this for almost all the M/Rs in the project; these are not using it.

        I think you're welcome to propose patches for your improvements #2 and #3. I don't know the answer for #4: if it's OK for there to be nothing in the vector cache at this point, the code shouldn't assume there is. And if the cache should have something I don't know why there isn't.

        Show
        Sean Owen added a comment - Lance, in your command line you use "MAVENOPTS" and not "MAVEN_OPTS". Is that the issue? I think I agree with Dan's patch, but wonder how these jobs ever worked otherwise? But yes everything needs to call setJar() or setJarByClass(). AbstractJob takes care of this for almost all the M/Rs in the project; these are not using it. I think you're welcome to propose patches for your improvements #2 and #3. I don't know the answer for #4: if it's OK for there to be nothing in the vector cache at this point, the code shouldn't assume there is. And if the cache should have something I don't know why there isn't.
        Hide
        Dan Brickley added a comment -

        re job jar error, see MAHOUT-428 MAHOUT-197.

        draft patch: https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt

        I posted a patch that got me past those errors in the recent mailing list thread 'Spectral clustering - a bundle of issues'. I'll paste the relevant chunk of my email below. see http://comments.gmane.org/gmane.comp.apache.mahout.user/9319


        Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html
        ... seems perhaps some code rot?

        Can anyone else report success with Spectral clustering against recent trunk?

        Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter
        10 --dimensions 37

        ...with the small example affinity file we discussed yesterday, I hit
        a series of problems.

        data: http://danbri.org/2011/mahout/afftest.txt

        1. As I mentioned in comments in
        http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/
        (both for local pseudo-cluster, and a real one) I had to patch in
        calls to job.setJarByClass before job.waitForCompletion. This problem
        occured for others elsewhere in Mahout, e.g. MAHOUT-428 and
        MAHOUT-197, but I presume it can't be hitting everyone. From grepping
        around, this might not be the only component missing setJarByClass
        calls. Or is this just me, somehow?

        2. Newlines in the input data made it fail, but the associated warning
        from AffinityMatrixInputMapper was very vague. I'd suggest allowing
        those and #-comments, but maybe not a good idea to make per-component
        syntax designs? Suggest also it's worth printing the problem line (see
        patch below) when complaining.

        3. Failing to load the affinity matrix (surely a requirement for
        further progress?) does not seem to halt the job, I see exceptions
        mixed in with ongoing processing (until a later problem hits us).
        Transcript: https://gist.github.com/1200455 ... actually it wasn't
        clear if the newline problem was more of a warning, and other rows
        from the input data were accepted. In which case, reporting them as
        java.io.IOException seems a bit draconian. So maybe bits of the input
        file were in fact loaded. It would be great to clarify what expected
        behaviour is.

        4. After all that, the job still fails. Full transcript here:
        https://gist.github.com/1200428

        Excerpt: (I've added a bit more reporting output in a few places)

        11/09/07 14:25:06 INFO common.VectorCache: Loading vector from:
        specout/calculations/diagonal/part-r-00000
        Exception in thread "main" java.util.NoSuchElementException
        at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
        at org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121)

        However that file does exist in hdfs, and seqdumper seems to accept
        it; it just seems empty:

        Input Path: specout/calculations/diagonal/part-r-00000
        Key class: class org.apache.hadoop.io.NullWritable Value Class: class
        org.apache.mahout.math.VectorWritable
        Count: 0

        I've posted an informal composite patch at
        https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt
        ... if you can confirm the above issues and a breakdown into JIRAs,
        I'll attach cleaner patches where appropriate.

        Show
        Dan Brickley added a comment - re job jar error, see MAHOUT-428 MAHOUT-197 . draft patch: https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt I posted a patch that got me past those errors in the recent mailing list thread 'Spectral clustering - a bundle of issues'. I'll paste the relevant chunk of my email below. see http://comments.gmane.org/gmane.comp.apache.mahout.user/9319 Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html ... seems perhaps some code rot? Can anyone else report success with Spectral clustering against recent trunk? Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter 10 --dimensions 37 ...with the small example affinity file we discussed yesterday, I hit a series of problems. data: http://danbri.org/2011/mahout/afftest.txt 1. As I mentioned in comments in http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/ (both for local pseudo-cluster, and a real one) I had to patch in calls to job.setJarByClass before job.waitForCompletion. This problem occured for others elsewhere in Mahout, e.g. MAHOUT-428 and MAHOUT-197 , but I presume it can't be hitting everyone. From grepping around, this might not be the only component missing setJarByClass calls. Or is this just me, somehow? 2. Newlines in the input data made it fail, but the associated warning from AffinityMatrixInputMapper was very vague. I'd suggest allowing those and #-comments, but maybe not a good idea to make per-component syntax designs? Suggest also it's worth printing the problem line (see patch below) when complaining. 3. Failing to load the affinity matrix (surely a requirement for further progress?) does not seem to halt the job, I see exceptions mixed in with ongoing processing (until a later problem hits us). Transcript: https://gist.github.com/1200455 ... actually it wasn't clear if the newline problem was more of a warning, and other rows from the input data were accepted. In which case, reporting them as java.io.IOException seems a bit draconian. So maybe bits of the input file were in fact loaded. It would be great to clarify what expected behaviour is. 4. After all that, the job still fails. Full transcript here: https://gist.github.com/1200428 Excerpt: (I've added a bit more reporting output in a few places) 11/09/07 14:25:06 INFO common.VectorCache: Loading vector from: specout/calculations/diagonal/part-r-00000 Exception in thread "main" java.util.NoSuchElementException at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152) at org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121) However that file does exist in hdfs, and seqdumper seems to accept it; it just seems empty: Input Path: specout/calculations/diagonal/part-r-00000 Key class: class org.apache.hadoop.io.NullWritable Value Class: class org.apache.mahout.math.VectorWritable Count: 0 I've posted an informal composite patch at https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt ... if you can confirm the above issues and a breakdown into JIRAs, I'll attach cleaner patches where appropriate.
        Hide
        Lance Norskog added a comment - - edited

        1) I hiked the memory up to 2g, same result. Is it possible the option did not transmit to the JVM that runs the job?
        2) I did not have this problem under Eclipse.

        In a separate investigation, running the spectralkmeans mahout job gives the attached command-line failure log attached as SpectralKMeans_fail_20110919.txt. Yes, this is the 'get jars out to the hadoop executor' problem. The 'job' jar does not seem to do what it needs. Again, note that one failure does not cause the whole job to exit. I submit that there are multiple problems inside the job, and somehow there is a problem where the main job configurations do not get transmitted to a subsidiary executor.

        Show
        Lance Norskog added a comment - - edited 1) I hiked the memory up to 2g, same result. Is it possible the option did not transmit to the JVM that runs the job? 2) I did not have this problem under Eclipse. In a separate investigation, running the spectralkmeans mahout job gives the attached command-line failure log attached as SpectralKMeans_fail_20110919.txt. Yes, this is the 'get jars out to the hadoop executor' problem. The 'job' jar does not seem to do what it needs. Again, note that one failure does not cause the whole job to exit. I submit that there are multiple problems inside the job, and somehow there is a problem where the main job configurations do not get transmitted to a subsidiary executor.
        Hide
        Sean Owen added a comment -

        This is just an OutOfMemoryError. You have to tell Maven to use more memory for its JVM or else most M/R jobs will fail like this locally. Use MAVEN_OPTS=-Xmx1g . I'm afraid this isn't the issue.

        Show
        Sean Owen added a comment - This is just an OutOfMemoryError. You have to tell Maven to use more memory for its JVM or else most M/R jobs will fail like this locally. Use MAVEN_OPTS=-Xmx1g . I'm afraid this isn't the issue.
        Hide
        Lance Norskog added a comment -

        As for 5-d points v.s. 2-d points, SVD does a great job, followed by random projection.

        Show
        Lance Norskog added a comment - As for 5-d points v.s. 2-d points, SVD does a great job, followed by random projection.
        Hide
        Lance Norskog added a comment -

        For completeness, the log when running under Eclipse is attached as EclipseLog_20110918.txt

        Show
        Lance Norskog added a comment - For completeness, the log when running under Eclipse is attached as EclipseLog_20110918.txt
        Hide
        Lance Norskog added a comment -

        Possibly a little help. When run from the command line via mvn exec, this is the error log. Note that
        a) an exception happens in an early m/r pass, and
        b) the exception is ignored by the full job executor.
        (MacOS X "Kitty Liver")

        lance$ MAVENOPTS=Xmx1000m mvn -q exec:java -Dexec.mainClass="org.apache.mahout.clustering.display.DisplaySpectralKMeans"

        SLF4J: Class path contains multiple SLF4J bindings.
        SLF4J: Found binding in [jar:file:/Users/lancenorskog/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: Found binding in [jar:file:/Users/lancenorskog/.m2/repository/org/slf4j/slf4j-jcl/1.6.1/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
        11/09/18 22:25:26 INFO common.HadoopUtil: Deleting samples
        11/09/18 22:25:26 INFO common.HadoopUtil: Deleting output
        11/09/18 22:25:26 INFO display.DisplayClustering: Generating 500 samples m=[1.0, 1.0] sd=3.0
        11/09/18 22:25:26 INFO display.DisplayClustering: Generating 300 samples m=[1.0, 0.0] sd=0.5
        11/09/18 22:25:26 INFO display.DisplayClustering: Generating 300 samples m=[0.0, 2.0] sd=0.1
        11/09/18 22:25:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
        11/09/18 22:25:28 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
        11/09/18 22:25:28 INFO input.FileInputFormat: Total input paths to process : 1
        11/09/18 22:25:28 INFO mapred.JobClient: Running job: job_local_0001
        11/09/18 22:25:28 INFO mapred.MapTask: io.sort.mb = 100
        *11/09/18 22:25:29 WARN mapred.LocalJobRunner: job_local_0001
        java.lang.OutOfMemoryError: Java heap space
        	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
        	at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674)
        	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
        	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)*
        11/09/18 22:25:29 INFO mapred.JobClient:  map 0% reduce 0%
        11/09/18 22:25:29 INFO mapred.JobClient: Job complete: job_local_0001
        
        Show
        Lance Norskog added a comment - Possibly a little help. When run from the command line via mvn exec, this is the error log. Note that a) an exception happens in an early m/r pass, and b) the exception is ignored by the full job executor. (MacOS X "Kitty Liver") lance$ MAVENOPTS=Xmx1000m mvn -q exec:java -Dexec.mainClass="org.apache.mahout.clustering.display.DisplaySpectralKMeans" SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/lancenorskog/.m2/repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/lancenorskog/.m2/repository/org/slf4j/slf4j-jcl/1.6.1/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http: //www.slf4j.org/codes.html#multiple_bindings for an explanation. 11/09/18 22:25:26 INFO common.HadoopUtil: Deleting samples 11/09/18 22:25:26 INFO common.HadoopUtil: Deleting output 11/09/18 22:25:26 INFO display.DisplayClustering: Generating 500 samples m=[1.0, 1.0] sd=3.0 11/09/18 22:25:26 INFO display.DisplayClustering: Generating 300 samples m=[1.0, 0.0] sd=0.5 11/09/18 22:25:26 INFO display.DisplayClustering: Generating 300 samples m=[0.0, 2.0] sd=0.1 11/09/18 22:25:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 11/09/18 22:25:28 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf( Class ) or JobConf#setJar( String ). 11/09/18 22:25:28 INFO input.FileInputFormat: Total input paths to process : 1 11/09/18 22:25:28 INFO mapred.JobClient: Running job: job_local_0001 11/09/18 22:25:28 INFO mapred.MapTask: io.sort.mb = 100 *11/09/18 22:25:29 WARN mapred.LocalJobRunner: job_local_0001 java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)* 11/09/18 22:25:29 INFO mapred.JobClient: map 0% reduce 0% 11/09/18 22:25:29 INFO mapred.JobClient: Job complete: job_local_0001
        Hide
        Dan Brickley added a comment -

        I had a look around, failed to find a string in the mahout java source corresponding to that path; presume it's coming from an included module or config file.

        Hadoop btw has

        ./io/MapFile.java: public static final String DATA_FILE_NAME = "data";

        though I don't see any direct use of MapFile or DATA_FILE_NAME, I'm only grepping around textually; Eclipse might have smarter tooling.

        http://lucene.472066.n3.nabble.com/Overhauled-org-apache-mahout-cf-taste-hadoop-item-td745286.html suggests mapfile isn't so much used any more, so this might be a false lead.

        Show
        Dan Brickley added a comment - I had a look around, failed to find a string in the mahout java source corresponding to that path; presume it's coming from an included module or config file. Hadoop btw has ./io/MapFile.java: public static final String DATA_FILE_NAME = "data"; though I don't see any direct use of MapFile or DATA_FILE_NAME, I'm only grepping around textually; Eclipse might have smarter tooling. http://lucene.472066.n3.nabble.com/Overhauled-org-apache-mahout-cf-taste-hadoop-item-td745286.html suggests mapfile isn't so much used any more, so this might be a false lead.
        Hide
        Shannon Quinn added a comment -

        Just for grins, I tried:

        FileInputFormat.getInputPaths(conf).length

        right before the TimesSquareJob started, and it was 1, not 2. Ever more confused.

        Show
        Shannon Quinn added a comment - Just for grins, I tried: FileInputFormat.getInputPaths(conf).length right before the TimesSquareJob started, and it was 1, not 2. Ever more confused.
        Hide
        Shannon Quinn added a comment - - edited

        I've been tooling around with this code for a few hours now and cannot figure out where the pesky "/data" is being appended to the overall path...or why the second Path that Lance mentioned isn't what is actually being used. It has to be somewhere in the Lanczos solver code (filtering into the DistributedRowMatrix and its TimesSquaredJob, as the latter is what is actually causing the exception), but in all my searching and println()-ing of paths I can't seem to find it.

        Just prior to the TimesSquaredJob kicking off, the Lanczos solver outputs that is it "Finding 4 singular vectors", followed by this output:

        11/09/11 19:28:42 INFO mapred.FileInputFormat: Total input paths to process : 2

        which is very confusing to me, since in the TimesSquaredJob "createTimesSquaredJobConf()" method, there is only one invocation of FileInputFormat.addInputPath(). This mysterious second input path may very well be the cause of the problems, but again I just can't seem to find where it's added.

        I'm going to keep looking, but any help in finding this bug would be greatly appreciated.

        Show
        Shannon Quinn added a comment - - edited I've been tooling around with this code for a few hours now and cannot figure out where the pesky "/data" is being appended to the overall path...or why the second Path that Lance mentioned isn't what is actually being used. It has to be somewhere in the Lanczos solver code (filtering into the DistributedRowMatrix and its TimesSquaredJob, as the latter is what is actually causing the exception), but in all my searching and println()-ing of paths I can't seem to find it. Just prior to the TimesSquaredJob kicking off, the Lanczos solver outputs that is it "Finding 4 singular vectors", followed by this output: 11/09/11 19:28:42 INFO mapred.FileInputFormat: Total input paths to process : 2 which is very confusing to me, since in the TimesSquaredJob "createTimesSquaredJobConf()" method, there is only one invocation of FileInputFormat.addInputPath(). This mysterious second input path may very well be the cause of the problems, but again I just can't seem to find where it's added. I'm going to keep looking, but any help in finding this bug would be greatly appreciated.
        Hide
        Dan Brickley added a comment -

        Not sure if you're mixing me and Danny Bickson, but I've certainly seen these errors mentioning tmp/data paths, ... but the problem was when attempting spectral clustering; I didn't get as far as having any results to display.

        Show
        Dan Brickley added a comment - Not sure if you're mixing me and Danny Bickson, but I've certainly seen these errors mentioning tmp/data paths, ... but the problem was when attempting spectral clustering; I didn't get as far as having any results to display.
        Hide
        Shannon Quinn added a comment - - edited

        I believe this is the exact problem Dan Brickley picked up on his thread to the users list; I'm working on this. The problem is somewhere in the SpectralKMeansDriver in how I set up the Paths that are used. Will update this week.

        Show
        Shannon Quinn added a comment - - edited I believe this is the exact problem Dan Brickley picked up on his thread to the users list; I'm working on this. The problem is somewhere in the SpectralKMeansDriver in how I set up the Paths that are used. Will update this week.
        Hide
        Lance Norskog added a comment - - edited

        Running DisplaySpectralKMeans gives this error:

        FileNotFound:Exception
        examples/output/calculations/laplacian-48/tmp/data not found

        In fact, the data is stored here:

        examples/output/calculations/laplacian-48/tmp/1314835934416372000/DistributedMatrix.times.inputVector/

        Also, the directory with "diagonal" does not have a number; it assumes it is underneath a job-unique path:

        examples/output/calculations:
        
        diagonal	laplacian-48	seqfile-160
        

        (The jobs create unique directories with a random 8-bit number.)

        Any hints on exactly which API call is wrong?

        Show
        Lance Norskog added a comment - - edited Running DisplaySpectralKMeans gives this error: FileNotFound:Exception examples/output/calculations/laplacian-48/tmp/data not found In fact, the data is stored here: examples/output/calculations/laplacian-48/tmp/1314835934416372000/DistributedMatrix.times.inputVector/ Also, the directory with "diagonal" does not have a number; it assumes it is underneath a job-unique path: examples/output/calculations: diagonal laplacian-48 seqfile-160 (The jobs create unique directories with a random 8-bit number.) Any hints on exactly which API call is wrong?
        Hide
        Lance Norskog added a comment -

        +1

        I'm documenting the Display outputs and it would be nice to have all of them

        Show
        Lance Norskog added a comment - +1 I'm documenting the Display outputs and it would be nice to have all of them
        Hide
        Jeff Eastman added a comment -

        The original example was extracting 5 eigenvectors and thus returned 5-d results. I changed it to extract 2 vectors and it used to run but displayed incorrect results.

        I'm (still since pre 0.5 testing, IIRC) getting a FileNotFoundException in the bowels of DRM.times while running this in local Hadoop mode. I wonder if it is possible to add a --method sequential implementation for SpectralKMeans to help separate the algorithmetic issues from the file bookkeeping ones?

        We have a sequential Lanczos implementation...

        Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File file:/home/dev/workspace/mahout/examples/output/calculations/laplacian-33/tmp/data does not exist.
        at org.apache.mahout.math.hadoop.DistributedRowMatrix.times(DistributedRowMatrix.java:222)
        at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
        at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.runJob(DistributedLanczosSolver.java:72)
        at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:155)
        at org.apache.mahout.clustering.display.DisplaySpectralKMeans.main(DisplaySpectralKMeans.java:72)
        Caused by: java.io.FileNotFoundException: File file:/home/dev/workspace/mahout/examples/output/calculations/laplacian-33/tmp/data does not exist.
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:51)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:211)
        at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:929)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:921)
        at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:838)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:765)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1200)
        at org.apache.mahout.math.hadoop.DistributedRowMatrix.times(DistributedRowMatrix.java:214)
        ... 4 more

        Show
        Jeff Eastman added a comment - The original example was extracting 5 eigenvectors and thus returned 5-d results. I changed it to extract 2 vectors and it used to run but displayed incorrect results. I'm (still since pre 0.5 testing, IIRC) getting a FileNotFoundException in the bowels of DRM.times while running this in local Hadoop mode. I wonder if it is possible to add a --method sequential implementation for SpectralKMeans to help separate the algorithmetic issues from the file bookkeeping ones? We have a sequential Lanczos implementation... Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File file:/home/dev/workspace/mahout/examples/output/calculations/laplacian-33/tmp/data does not exist. at org.apache.mahout.math.hadoop.DistributedRowMatrix.times(DistributedRowMatrix.java:222) at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.runJob(DistributedLanczosSolver.java:72) at org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:155) at org.apache.mahout.clustering.display.DisplaySpectralKMeans.main(DisplaySpectralKMeans.java:72) Caused by: java.io.FileNotFoundException: File file:/home/dev/workspace/mahout/examples/output/calculations/laplacian-33/tmp/data does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:51) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:211) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:929) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:921) at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:838) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:765) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1200) at org.apache.mahout.math.hadoop.DistributedRowMatrix.times(DistributedRowMatrix.java:214) ... 4 more
        Hide
        Shannon Quinn added a comment -

        +1, I'm on it.

        I'm a little unclear as to the context of the initial Hudson comment: the display method is expecting 2D vectors, but getting 5D ones?

        Show
        Shannon Quinn added a comment - +1, I'm on it. I'm a little unclear as to the context of the initial Hudson comment: the display method is expecting 2D vectors, but getting 5D ones?
        Hide
        Sean Owen added a comment -

        It seems like there's an issue here that should be fixed with some priority.

        Show
        Sean Owen added a comment - It seems like there's an issue here that should be fixed with some priority.
        Hide
        Sean Owen added a comment -

        There does seem to be a bug here so don't think this should be closed. But it's a problem with the examples code it seems if anything. And I don't know if there's reason to suspect there can be more investigation in a few weeks before 0.5? if not, probably best to push it down the road. Obviously would be great to get any progress in before the upcoming release.

        Show
        Sean Owen added a comment - There does seem to be a bug here so don't think this should be closed. But it's a problem with the examples code it seems if anything. And I don't know if there's reason to suspect there can be more investigation in a few weeks before 0.5? if not, probably best to push it down the road. Obviously would be great to get any progress in before the upcoming release.
        Hide
        Shannon Quinn added a comment -

        In case anyone was interested, I wrote a quick script that visualizes the labelings. Not really much insight to be had other than to confirm that it doesn't work quite right yet.

        Show
        Shannon Quinn added a comment - In case anyone was interested, I wrote a quick script that visualizes the labelings. Not really much insight to be had other than to confirm that it doesn't work quite right yet.
        Hide
        Shannon Quinn added a comment -

        No, there's definitely something wrong here. I've attached some synthetic data I generated - concentric circles of 2D points, which spectral clustering is particularly good at correctly grouping. The "raw" file contains 450 raw data points in 3 separate circles (feel free to plot them to take a look). The affinities are generated by providing a cutoff in terms of Euclidean distance - say, 2.0 - where any point that has a distance of < 2 is given an affinity (here using the Gaussian kernel, gives a nice [0, 1] affinity), and everything else is set to 0 (enforces sparsity in the affinity matrix). Plus, I constructed the data specifically such that the points between circles have a minimum distance of 2.

        Unfortunately, if you run SpectralKMeans on the aff.txt file, other than the tightly-packed cluster in the middle it doesn't do a particularly good job of identifying the other two clusters (points 0-149 should have the same ID, as well as points 150-299, and 300-449). Obviously there is still something amiss; a good place to start is to take a look at the eigenvectors generated by the LanczosSolver. If everything is behaving as it should, these should show piecewise constancy: that is, each 150 consecutive elements in the 450-element vectors should have about the same value. If this is not the case, we're either generating affinities incorrectly, or there's a problem with the algorithm itself.

        Also, I noticed when attempting to run the program that it often doesn't show the entire list of available and required arguments. I couldn't reliably determine a cause; often it would show just 1 of the required arguments, but if I supplied some of the required arguments and left out others, it would display all of them. I'm assuming this is a bug; any idea where I could find it?

        Show
        Shannon Quinn added a comment - No, there's definitely something wrong here. I've attached some synthetic data I generated - concentric circles of 2D points, which spectral clustering is particularly good at correctly grouping. The "raw" file contains 450 raw data points in 3 separate circles (feel free to plot them to take a look). The affinities are generated by providing a cutoff in terms of Euclidean distance - say, 2.0 - where any point that has a distance of < 2 is given an affinity (here using the Gaussian kernel, gives a nice [0, 1] affinity), and everything else is set to 0 (enforces sparsity in the affinity matrix). Plus, I constructed the data specifically such that the points between circles have a minimum distance of 2. Unfortunately, if you run SpectralKMeans on the aff.txt file, other than the tightly-packed cluster in the middle it doesn't do a particularly good job of identifying the other two clusters (points 0-149 should have the same ID, as well as points 150-299, and 300-449). Obviously there is still something amiss; a good place to start is to take a look at the eigenvectors generated by the LanczosSolver. If everything is behaving as it should, these should show piecewise constancy: that is, each 150 consecutive elements in the 450-element vectors should have about the same value. If this is not the case, we're either generating affinities incorrectly, or there's a problem with the algorithm itself. Also, I noticed when attempting to run the program that it often doesn't show the entire list of available and required arguments. I couldn't reliably determine a cause; often it would show just 1 of the required arguments, but if I supplied some of the required arguments and left out others, it would display all of them. I'm assuming this is a bug; any idea where I could find it?
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #567 (See https://hudson.apache.org/hudson/job/Mahout-Quality/567/)

        Show
        Hudson added a comment - Integrated in Mahout-Quality #567 (See https://hudson.apache.org/hudson/job/Mahout-Quality/567/ )
        Hide
        Jeff Eastman added a comment -

        The Display algorithm now runs without errors but the 2 clusters it produces are clearly not what I was expecting. Probably a gross misunderstanding on my part and a final output processing step that needs to be invented.

        Show
        Jeff Eastman added a comment - The Display algorithm now runs without errors but the 2 clusters it produces are clearly not what I was expecting. Probably a gross misunderstanding on my part and a final output processing step that needs to be invented.
        Hide
        Sean Owen added a comment -

        Jeff sounds like there is no outstanding issue here at the moment, or something more to track here?

        Show
        Sean Owen added a comment - Jeff sounds like there is no outstanding issue here at the moment, or something more to track here?
        Hide
        Jeff Eastman added a comment -

        This does not impact 0.4 usability as spectral clustering is still experimental and needs I/O. The display routine could be removed for hygiene but I prefer to leave it in with a caveat that it is part of several work-in-progress issues to integrate spectral clustering into the rest of the clustering portfolio.

        Show
        Jeff Eastman added a comment - This does not impact 0.4 usability as spectral clustering is still experimental and needs I/O. The display routine could be removed for hygiene but I prefer to leave it in with a caveat that it is part of several work-in-progress issues to integrate spectral clustering into the rest of the clustering portfolio.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #392 (See https://hudson.apache.org/hudson/job/Mahout-Quality/392/)
        MAHOUT-524: Moved numEigensWritten initialization out of loop. SpectralKMeans now runs to completion but display routing is expecting a 2-d vector and is getting a 5-d vector. Not clustering the original input points. More to test but CleanEigensJob is working.

        Show
        Hudson added a comment - Integrated in Mahout-Quality #392 (See https://hudson.apache.org/hudson/job/Mahout-Quality/392/ ) MAHOUT-524 : Moved numEigensWritten initialization out of loop. SpectralKMeans now runs to completion but display routing is expecting a 2-d vector and is getting a 5-d vector. Not clustering the original input points. More to test but CleanEigensJob is working.

          People

          • Assignee:
            Shannon Quinn
            Reporter:
            Jeff Eastman
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development