Mahout
  1. Mahout
  2. MAHOUT-766

fuzzy kmeans - all cluster with the same top terms

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: Clustering, Examples
    • Labels:
      None
    • Environment:

      tested in OSX and linux

      Description

      believe there is something wrong with fkmeans in trunk.

      I am using code from trunk (last checkout 6/30/11). To recreate is very simple:
      1) change examples/bin/build-reuters.sh to use fkmeans and set -m 2
      2) run build-reuters.sh
      3) Dump the cluster. I'm doing: ../../bin/mahout clusterdump -dt sequencefile -s ./mahout-work/reuters-kmeans/clusters-6 -b 100 -o ./reuters-clusterdump.txt -d ./mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0

      here is what the clusters look like:
      SV-15898{n=34 c=[0:0.020, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.7254762602900604
      mln => 1.2510936664951733
      dlrs => 1.1340145215097008
      3 => 1.0643797240793276
      pct => 1.0422760712239152
      reuter => 1.0202689935247569
      its => 0.9997771992646881
      from => 0.9903731234557381
      year => 0.8855389859684145
      vs => 0.8291746545786391
      :SV-14766{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6406710289350412
      mln => 1.2174993414858022
      dlrs => 1.0937941570322955
      3 => 1.0334420773050856
      pct => 0.991539915235039
      reuter => 0.990042452019326
      its => 0.9508638527143669
      from => 0.9403885495991262
      vs => 0.865437130369746
      year => 0.8463503194752994
      :SV-14854{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.641260962665307
      mln => 1.217806578134094
      dlrs => 1.0941157210136143
      3 => 1.0336934328877394
      pct => 0.991895013999163
      reuter => 0.9902889592990656
      its => 0.9512076670014483
      from => 0.9407384847445094
      vs => 0.8653426311034671
      year => 0.8466407590692175
      :SV-14890{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6410352907185948
      mln => 1.21769021136256
      dlrs => 1.0939933408434481
      3 => 1.0335977297579235
      pct => 0.991759193577722
      reuter => 0.9901951250301172
      its => 0.9510761761632947
      from => 0.9406047832581563
      vs => 0.8653814488835572
      year => 0.8465301083353372
      :SV-14972{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.640981249652196
      mln => 1.2176595452829564
      dlrs => 1.093962519439548
      3 => 1.0335737897463568
      pct => 0.9917266257955816
      reuter => 0.9901715950801396
      its => 0.9510446208123859
      from => 0.9405723357372776
      vs => 0.8653843699725567
      year => 0.846502466267153
      :SV-15023{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6399319888551425
      mln => 1.217099157115808
      dlrs => 1.0933830369192543
      3 => 1.033121271434882
      pct => 0.991094828319561
      reuter => 0.9897275313905611
      its => 0.9504327303592046
      from => 0.9399480272494183
      vs => 0.8655203514280634
      year => 0.8459804922897428
      :SV-15330{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6411480082558068
      mln => 1.217746071140758
      dlrs => 1.0940532425506244
      3 => 1.0336447143638317
      pct => 0.9918269975797083
      reuter => 0.990241145450359
      its => 0.9511417993006985
      from => 0.9406712099799636
      vs => 0.8653569180999117
      year => 0.8465844425179013
      :SV-15403{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6493270418577013
      mln => 1.221708475489808
      dlrs => 1.0983489300320377
      3 => 1.0370024996153944
      pct => 0.9967446058994232
      reuter => 0.993528974793619
      its => 0.9558988111209523
      from => 0.9454911460774864
      vs => 0.8633642497287671
      year => 0.8505083085439775
      :SV-15514{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6414524586689534
      mln => 1.2179029815366167
      dlrs => 1.094218299808865
      3 => 1.033773769117182
      pct => 0.9920102286561391
      reuter => 0.9903676795676004
      its => 0.9513191861395162
      from => 0.9408515920762511
      vs => 0.865304353452142
      year => 0.8467337135094862
      :SV-15549{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.640632892454694
      mln => 1.2174764812983898
      dlrs => 1.0937717467869699
      3 => 1.033424727632325
      pct => 0.99151691360307
      reuter => 0.9900253758026865
      its => 0.9508415534060888
      from => 0.9403654699584985
      vs => 0.865436402399392
      year => 0.8463303217162843
      :SV-15616{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6402745961421197
      mln => 1.217287104215781
      dlrs => 1.0935749393200054
      3 => 1.0332709291683844
      pct => 0.9913012005612369
      reuter => 0.9898744911012118
      its => 0.9506326562835085
      from => 0.9401525895225771
      vs => 0.8654873596392523
      year => 0.8461528918952358
      :SV-15674{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6402335213893247
      mln => 1.2172651791725515
      dlrs => 1.0935522610806727
      3 => 1.0332532137000938
      pct => 0.991276468108388
      reuter => 0.9898571070574692
      its => 0.9506087026962596
      from => 0.9401281555632803
      vs => 0.8654927058873914
      year => 0.8461324681573653
      :SV-15720{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.641454220566282
      mln => 1.2179063418879368
      dlrs => 1.0942205822099829
      3 => 1.0337754035575257
      pct => 0.9920113271819195
      reuter => 0.9903693325123661
      its => 0.9513202705619623
      from => 0.9408530174807668
      vs => 0.8653096216062077
      year => 0.8467355860669477
      :SV-15732{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6418679366988789
      mln => 1.218118262616823
      dlrs => 1.0944441677361394
      3 => 1.0339502052648608
      pct => 0.9922602967957669
      reuter => 0.9905406967751569
      its => 0.9515612774046113
      from => 0.941098001639954
      vs => 0.865235154416334
      year => 0.8469379811534101
      :SV-15825{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6403540331112847
      mln => 1.2173302824011656
      dlrs => 1.0936192179118565
      3 => 1.0333054698476525
      pct => 0.9913490440255205
      reuter => 0.9899084014354236
      its => 0.9506790000021428
      from => 0.9401999656754023
      vs => 0.8654787849286104
      year => 0.8461927112339609
      :SV-15888{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.641852069569193
      mln => 1.218106579705691
      dlrs => 1.0944336674208315
      3 => 1.0339422184421034
      pct => 0.9922506923700831
      reuter => 0.9905327937543529
      its => 0.951551949990525
      from => 0.9410880514065464
      vs => 0.8652299423273659
      year => 0.8469287549740471
      :SV-15944{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6406094746503062
      mln => 1.2174640910103491
      dlrs => 1.0937588768380255
      3 => 1.0334146735611798
      pct => 0.9915028147402405
      reuter => 0.9900155118531778
      its => 0.9508279001565995
      from => 0.9403515526055797
      vs => 0.865439705916966
      year => 0.846318717539638
      :SV-15952{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.641608350634413
      mln => 1.2179827157677379
      dlrs => 1.094302484756082
      3 => 1.033839606583586
      pct => 0.9921040410110572
      reuter => 0.990432219413613
      its => 0.9514099986904929
      from => 0.9409438763575203
      vs => 0.8652760331837802
      year => 0.8468099163160301
      :SV-15954{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6429205353451672
      mln => 1.2186434984636658
      dlrs => 1.0950054459143779
      3 => 1.0343894404834142
      pct => 0.992893505149969
      reuter => 0.9909710261706427
      its => 0.9521740690117075
      from => 0.9417194634871013
      vs => 0.8650137662755684
      year => 0.8474476266423354
      :SV-16007{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.6401767760282457
      mln => 1.2172339691485916
      dlrs => 1.093520432998812
      3 => 1.0332284013507513
      pct => 0.9912422858233993
      reuter => 0.9898327402827573
      its => 0.9505755879363272
      from => 0.9400942591120444
      vs => 0.8654979916098049
      year => 0.8461038772989482
      :SV-16037{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004, 0.02:0.002, 0.03:0.001, 0.046:0.0
      Top Terms:
      said => 1.640610618380475
      mln => 1.2174645746382695
      dlrs => 1.0937594396319776
      3 => 1.0334151203058977
      pct => 0.9915035014016228
      reuter => 0.9900159476830741
      its => 0.9508285640147016
      from => 0.9403522136131415
      vs => 0.8654392679742507
      year => 0.846319234572972

        Activity

        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1101 (See https://builds.apache.org/job/Mahout-Quality/1101/)
        MAHOUT-766: Changed m argument to 1.1 and switched Dirichlet to use clustering vs. classifier implementation. Added cosine distance measure to reuters kmeans.

        jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1185737
        Files :

        • /mahout/trunk/examples/bin/build-reuters.sh
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayDirichlet.java
        • /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1101 (See https://builds.apache.org/job/Mahout-Quality/1101/ ) MAHOUT-766 : Changed m argument to 1.1 and switched Dirichlet to use clustering vs. classifier implementation. Added cosine distance measure to reuters kmeans. jeastman : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1185737 Files : /mahout/trunk/examples/bin/build-reuters.sh /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayDirichlet.java /mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java
        Hide
        Jeff Eastman added a comment -

        I think the problem here is using the default distance measure (EuclideanSquared) with fuzzyk. I added

        -dm org.apache.mahout.common.distance.CosineDistanceMeasure \

        to the script and it produced clusters that differ somewhat from each other but still have a high degree of similarity in their terms and weights. Then I decreased m to 1.1 and, predictably, the clusters diverged to be more like the kmeans results.

        It does seem like there is a lot of sensitivity to the values of m and the range 1 < m <= 2 has a large impact on the clusters.

        I'm going to resolve this as not a problem.

        Show
        Jeff Eastman added a comment - I think the problem here is using the default distance measure (EuclideanSquared) with fuzzyk. I added -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ to the script and it produced clusters that differ somewhat from each other but still have a high degree of similarity in their terms and weights. Then I decreased m to 1.1 and, predictably, the clusters diverged to be more like the kmeans results. It does seem like there is a lot of sensitivity to the values of m and the range 1 < m <= 2 has a large impact on the clusters. I'm going to resolve this as not a problem.
        Hide
        Jeff Eastman added a comment -

        I can duplicate this issue; however, I am not convinced it is uncovering a defect for the following reasons:

        • when I run clusterdump on all the clusters-x directories, what I see is that fuzzyk is actually converging on the reported cluster definitions. The initial clusters are significantly different, but after three iterations they have all converged upon the k, identical clusters.
        • since fuzzyk assigns each point to each cluster with a weight inversely proportional to its distance from the cluster it would be expected that clusters would tend to overlap, at least. You can see this by running DisplayFuzzyKmeans. With m=2 the clusters overlap significantly, with m=1.1 they are much more disjoint and look, as advertised, more like kmeans. This seems reasonable to me.

        I don't have a lot of intuition about how fuzzyk should behave with a text clustering problem like reuters. Pallavi and Grant have their fingerprints on the MAHOUT-74 issue which created this implementation, but a lot of others, including me, have been in the code. Is this a defect or just a consequence of this algorithm running with these arguments on this data?

        Show
        Jeff Eastman added a comment - I can duplicate this issue; however, I am not convinced it is uncovering a defect for the following reasons: when I run clusterdump on all the clusters-x directories, what I see is that fuzzyk is actually converging on the reported cluster definitions. The initial clusters are significantly different, but after three iterations they have all converged upon the k, identical clusters. since fuzzyk assigns each point to each cluster with a weight inversely proportional to its distance from the cluster it would be expected that clusters would tend to overlap, at least. You can see this by running DisplayFuzzyKmeans. With m=2 the clusters overlap significantly, with m=1.1 they are much more disjoint and look, as advertised, more like kmeans. This seems reasonable to me. I don't have a lot of intuition about how fuzzyk should behave with a text clustering problem like reuters. Pallavi and Grant have their fingerprints on the MAHOUT-74 issue which created this implementation, but a lot of others, including me, have been in the code. Is this a defect or just a consequence of this algorithm running with these arguments on this data?

          People

          • Assignee:
            Jeff Eastman
            Reporter:
            Paulo Magalhaes
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development