Mahout
  1. Mahout
  2. MAHOUT-966

Mismatch in the number of points given by the clusterDumper and ClusterOutputPostProcessor

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Not a Problem
    • Affects Version/s: 0.6
    • Fix Version/s: 0.8
    • Component/s: Integration
    • Labels:
      None
    • Environment:

      hadoop 0.20.2 mahout 0.6

      Description

      After running the post processor the number of points that each cluster contains is not matching the number of points each cluster should contain as stated by clusterdumper.

      MSV-287

      { n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}

      MSV-145

      { n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}

      the n mentioned in clusters-n-final against each cluster is different from the number of points actually contained in d directory for each cluster. Any idea why is this happening ...?

      1. points100dCCNorm.txt
        1.77 MB
        Gaurav Redkar
      2. mtestdata.txt
        1.76 MB
        Gaurav Redkar
      3. clusterpp-output.txt
        14 kB
        Tharindu Mathew
      4. cluster-dumper-output.txt
        1.59 MB
        Tharindu Mathew

        Activity

        Hide
        Gaurav Redkar added a comment -

        the command line commands that were used as follows:
        for clustering:
        bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 15 -cd 0.05 -t1 0.7 -t2 0.5 -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -i testdata -ow -o output

        for viewing the output :
        bin/mahout clusterpp -i output -o output/ppclusters

        while using clusterdumper :
        bin/mahout clusterdump -s output/clusters-3-final -p output/clusteredPoints -o /usr/local/trunk/examples/clusteranalyze300112.txt

        Show
        Gaurav Redkar added a comment - the command line commands that were used as follows: for clustering: bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 15 -cd 0.05 -t1 0.7 -t2 0.5 -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -i testdata -ow -o output for viewing the output : bin/mahout clusterpp -i output -o output/ppclusters while using clusterdumper : bin/mahout clusterdump -s output/clusters-3-final -p output/clusteredPoints -o /usr/local/trunk/examples/clusteranalyze300112.txt
        Hide
        Gaurav Redkar added a comment -

        The dataset has 1000 points with 200 attributes each

        Show
        Gaurav Redkar added a comment - The dataset has 1000 points with 200 attributes each
        Hide
        Paritosh Ranjan added a comment -

        Ok, I will have a look at it.

        Show
        Paritosh Ranjan added a comment - Ok, I will have a look at it.
        Hide
        Paritosh Ranjan added a comment -

        The very first impression that I get is that you have not used -cl option while clustering. It means the vectors have not been clustered.

        --clustering (-cl) If present, run clustering after
        the iterations have taken place

        https://cwiki.apache.org/MAHOUT/mean-shift-commandline.html

        Clustering in Mahout is a two step process, the first one finds the centroids of the clusters, the second one finds the vectors which were associated with each centroid.

        To trigger the second step ( find vectors for each centroid ), -cl command line option ( runClustering parameter in Java ) is used.

        The ClusterOutputPostProcessor works on clusterd data i.e. it will work on vectors which were clustered in the second step.

        Please try with -cl option while clustering. I think it would solve the problem. If not, I will investigate more.

        Show
        Paritosh Ranjan added a comment - The very first impression that I get is that you have not used -cl option while clustering. It means the vectors have not been clustered. --clustering (-cl) If present, run clustering after the iterations have taken place https://cwiki.apache.org/MAHOUT/mean-shift-commandline.html Clustering in Mahout is a two step process, the first one finds the centroids of the clusters, the second one finds the vectors which were associated with each centroid. To trigger the second step ( find vectors for each centroid ), -cl command line option ( runClustering parameter in Java ) is used. The ClusterOutputPostProcessor works on clusterd data i.e. it will work on vectors which were clustered in the second step. Please try with -cl option while clustering. I think it would solve the problem. If not, I will investigate more.
        Hide
        Gaurav Redkar added a comment -

        Hello,

        As Paritosh suggested, i tried specifying the -cl option while clustering. But I am still experiencing the same problem. The number of members printed by the clusterdumper code match the number of points generated by the ClusterOutputPostProcessor for each cluster. Sadly this number does not match the value 'n' for that cluster in the clusterdumper implementation.

        Also while running the algorithm on a different dataset,the clustering algorithm resulted in two clusters with the same cluster identifier..!! Also that cluster contained some of the points twice. Any idea as to why is this happening.?

        The command used for performing the clustering job is :

        bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 15 -cd 5 -t1 100 -t2 30 -cl -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -i testdata -ow -o output

        i am attaching the dataset on which i tried the clustering. Kindly give your suggestions on it.

        Show
        Gaurav Redkar added a comment - Hello, As Paritosh suggested, i tried specifying the -cl option while clustering. But I am still experiencing the same problem. The number of members printed by the clusterdumper code match the number of points generated by the ClusterOutputPostProcessor for each cluster. Sadly this number does not match the value 'n' for that cluster in the clusterdumper implementation. Also while running the algorithm on a different dataset,the clustering algorithm resulted in two clusters with the same cluster identifier..!! Also that cluster contained some of the points twice. Any idea as to why is this happening.? The command used for performing the clustering job is : bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 15 -cd 5 -t1 100 -t2 30 -cl -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -i testdata -ow -o output i am attaching the dataset on which i tried the clustering. Kindly give your suggestions on it.
        Hide
        Paritosh Ranjan added a comment -

        Looks like there is a bug in clusterdumer, other users have also faced it. See this :
        http://comments.gmane.org/gmane.comp.apache.mahout.user/10906

        To summarize, there is a mismatch between the value of n( number of vectors in particular cluster ) displayed by clusterdumper and number of vectors actually present in it. Prima facie, it looks the value of n is printed incorrectly.

        Show
        Paritosh Ranjan added a comment - Looks like there is a bug in clusterdumer, other users have also faced it. See this : http://comments.gmane.org/gmane.comp.apache.mahout.user/10906 To summarize, there is a mismatch between the value of n( number of vectors in particular cluster ) displayed by clusterdumper and number of vectors actually present in it. Prima facie, it looks the value of n is printed incorrectly.
        Hide
        Tharindu Mathew added a comment -

        Initial analysis shows, the n printed by Cluster Dumper to deviate significantly.

        While clusterpp and cluster dumper shows 200 points each in the cluster for a clustering run with params in this mail, while n from cluster dumper shows values of 223, 392, 290, 207 and 235.(refer cluster-dumper-output.txt and cluterpp-output.txt - hacked the clusterpp code to provide this output)

        The correct fix is to make n display 200.

        Show
        Tharindu Mathew added a comment - Initial analysis shows, the n printed by Cluster Dumper to deviate significantly. While clusterpp and cluster dumper shows 200 points each in the cluster for a clustering run with params in this mail, while n from cluster dumper shows values of 223, 392, 290, 207 and 235.(refer cluster-dumper-output.txt and cluterpp-output.txt - hacked the clusterpp code to provide this output) The correct fix is to make n display 200.
        Hide
        Gaurav Redkar added a comment -

        I modified the clusterdumper and meanshift clustering source codes in order to make the clusterdumper output the number of boundPoints(size of the "boundPoints" list basically) along with the numPoints, radius and center for each cluster.

        When i ran the clustering job on synthetic_control.data using the following parameters:

        bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 25 -cd 5 -t1 50 -t2 10 -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -i testdata -ow -o output -cl

        some of the clusters had different values for the variable "numPoints" and size of "boundPoints".

        What i want to know is what is the difference between "numPoints" and the "boundPoints" and shouldnt the size of "boundPoints" list be the same as "numPoints"..?

        Also in referring to this thread, the number of members printed by each cluster matched the number of boundPoints for that cluster.

        Any suggestions..?

        Show
        Gaurav Redkar added a comment - I modified the clusterdumper and meanshift clustering source codes in order to make the clusterdumper output the number of boundPoints(size of the "boundPoints" list basically) along with the numPoints, radius and center for each cluster. When i ran the clustering job on synthetic_control.data using the following parameters: bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 25 -cd 5 -t1 50 -t2 10 -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -i testdata -ow -o output -cl some of the clusters had different values for the variable "numPoints" and size of "boundPoints". What i want to know is what is the difference between "numPoints" and the "boundPoints" and shouldnt the size of "boundPoints" list be the same as "numPoints"..? Also in referring to this thread, the number of members printed by each cluster matched the number of boundPoints for that cluster. Any suggestions..?
        Hide
        Paritosh Ranjan added a comment -

        Gaurav,
        Would you like to take this further?

        Show
        Paritosh Ranjan added a comment - Gaurav, Would you like to take this further?
        Hide
        Gaurav Redkar added a comment -

        yeah i can try to look into thjs issue. I want a clarification regarding the difference between the variables "numPoints" and "boundPoints" as mentioned in my previous comment above.

        The point to note is that the size of "boundPoints" ("boundpoints" is a list of points belonging to a cluster) that i tried to print by tweaking the clusterdumper code actually matched the number of points printed in each cluster. so could it be that the "numPoints" was not properly calculated at the end of last iteration before the algorithm terminates..? It is just a guess. I will try to look deeper into it.

        Show
        Gaurav Redkar added a comment - yeah i can try to look into thjs issue. I want a clarification regarding the difference between the variables "numPoints" and "boundPoints" as mentioned in my previous comment above. The point to note is that the size of "boundPoints" ("boundpoints" is a list of points belonging to a cluster) that i tried to print by tweaking the clusterdumper code actually matched the number of points printed in each cluster. so could it be that the "numPoints" was not properly calculated at the end of last iteration before the algorithm terminates..? It is just a guess. I will try to look deeper into it.
        Hide
        Grant Ingersoll added a comment -

        Any update on this? Seems like it should be fixed for 0.8

        Show
        Grant Ingersoll added a comment - Any update on this? Seems like it should be fixed for 0.8
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2031 (See https://builds.apache.org/job/Mahout-Quality/2031/)
        MAHOUT-966: add option to supply your own Reuters dataset (Revision 1488749)

        Result = FAILURE
        gsingers :
        Files :

        • /mahout/trunk/examples/bin/cluster-reuters.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2031 (See https://builds.apache.org/job/Mahout-Quality/2031/ ) MAHOUT-966 : add option to supply your own Reuters dataset (Revision 1488749) Result = FAILURE gsingers : Files : /mahout/trunk/examples/bin/cluster-reuters.sh
        Hide
        Grant Ingersoll added a comment -

        This is actually behaving correctly. Here's what I did:

        1. bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 25 -cd 5 -t1 50 -t2 10 -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -i /path/to/synthetic_control.data -ow -o output -cl
        2. Independently, do:
          1. bin/mahout clusterdump -i output/clusters-5-final/ -p output/clusteredPoints -o /tmp/clusterdump.txt
          2. For clusterPP
            1. bin/mahout clusterpp -i output -o output/post
            2. bin/mahout seqdumper -i output/post/0/part-r-00000 --facets

        Both report 5 clusters total.
        For clusterpp, Seq Dumper reports the following number of points per cluster:

        ----Facets--
        Key Count
        0 145
        101 31
        104 25
        200 199
        300 200

        For clusterdumper, I see:

        MSV-0{n=145
        MSV-101{n=31
        MSV-104{n=25
        MSV-200{n=199
        MSV-300{n=200

        Show
        Grant Ingersoll added a comment - This is actually behaving correctly. Here's what I did: bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job -x 25 -cd 5 -t1 50 -t2 10 -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -i /path/to/synthetic_control.data -ow -o output -cl Independently, do: bin/mahout clusterdump -i output/clusters-5-final/ -p output/clusteredPoints -o /tmp/clusterdump.txt For clusterPP bin/mahout clusterpp -i output -o output/post bin/mahout seqdumper -i output/post/0/part-r-00000 --facets Both report 5 clusters total. For clusterpp, Seq Dumper reports the following number of points per cluster: ---- Facets -- Key Count 0 145 101 31 104 25 200 199 300 200 For clusterdumper, I see: MSV-0{n=145 MSV-101{n=31 MSV-104{n=25 MSV-200{n=199 MSV-300{n=200
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #2032 (See https://builds.apache.org/job/Mahout-Quality/2032/)
        MAHOUT-966: adding some more options to cluster-syntheticcontrol (Revision 1488758)

        Result = SUCCESS
        gsingers :
        Files :

        • /mahout/trunk/examples/bin/cluster-syntheticcontrol.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #2032 (See https://builds.apache.org/job/Mahout-Quality/2032/ ) MAHOUT-966 : adding some more options to cluster-syntheticcontrol (Revision 1488758) Result = SUCCESS gsingers : Files : /mahout/trunk/examples/bin/cluster-syntheticcontrol.sh

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Gaurav Redkar
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development