Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-19

Hierarchial clusterer

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Later
    • None
    • None
    • classic
    • None

    Description

      In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.

      For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.

      Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster.

      Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.

      My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.

      See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

      Attachments

        1. MAHOUT-19_20100319.diff
          50 kB
          Karl Wettin
        2. MAHOUT-19.txt
          235 kB
          Karl Wettin
        3. MAHOUT-19.txt
          241 kB
          Karl Wettin
        4. MAHOUT-19.txt
          262 kB
          Karl Wettin
        5. MAHOUT-19.txt
          111 kB
          Karl Wettin
        6. TestBottomFeed.test.png
          685 kB
          Karl Wettin
        7. TestTopFeed.test.png
          652 kB
          Karl Wettin
        8. MAHOUT-19.txt
          205 kB
          Karl Wettin

        Issue Links

          Activity

            People

              karl.wettin Karl Wettin
              karl.wettin Karl Wettin
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: