Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-121

Speed up distance calculations for sparse vectors

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.2
    • classic
    • None

    Description

      From my mail to the Mahout mailing list.

      I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list.

      http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/

      I tried out this optimization. The test setup had 2000 document vectors with few hundred items. I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.

      Current Canopy Generation: 28 min 15 sec.
      Canopy Generation with distance optimization: 1 min 38 sec.

      I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector implementation to used primitive collections by Trove [
      http://trove4j.sourceforge.net/ ].

      Distance optimization with Trove: 59 sec
      Current canopy generation with Trove: 21 min 55 sec

      To sum, these two optimizations reduced cluster generation time by a 97%.

      Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.

      Licensing of Trove seems to be an issue which needs to be addressed.

      Attachments

        1. MAHOUT-121-new-distance-optimization.patch
          5 kB
          Nicolás Fantone
        2. MAHOUT-121-distance-optimization.patch
          5 kB
          Nicolás Fantone
        3. MAHOUT-121-cluster-distance.patch
          1 kB
          Grant Ingersoll
        4. doc-vector-4k
          4.74 MB
          Shashikant Kore
        5. mahout-121.patch
          16 kB
          Grant Ingersoll
        6. Canopy_Wiki_1000-2009-06-24.snapshot
          1.78 MB
          Grant Ingersoll
        7. MAHOUT-121.patch
          113 kB
          Sean R. Owen
        8. MAHOUT-121.patch
          111 kB
          Sean R. Owen
        9. MAHOUT-121jfe.patch
          109 kB
          Jeff Eastman
        10. MAHOUT-121.patch
          108 kB
          Sean R. Owen
        11. MAHOUT-121.patch
          107 kB
          Sean R. Owen
        12. MAHOUT-121.patch
          9 kB
          Grant Ingersoll
        13. Mahout1211.patch
          9 kB
          Sean R. Owen
        14. mahout-121.patch
          21 kB
          Shashikant Kore

        Issue Links

          Activity

            People

              gsingers Grant Ingersoll
              kshashi Shashikant Kore
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: