Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Clustering
    • Labels:
      None

      Description

      In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children.

      For performance reasons I always trained trees from the top->down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom->up? The thing is, I don't think it is possible to train the tree top->down using map reduce. I do however think it is possible to train it bottom->up. I would very much appreciate any thoughts on this.

      Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster.

      Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph.

      My focus is on clustering text documents for instant "more like this"-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance.

      See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer.

      1. MAHOUT-19.txt
        205 kB
        Karl Wettin
      2. TestTopFeed.test.png
        652 kB
        Karl Wettin
      3. TestBottomFeed.test.png
        685 kB
        Karl Wettin
      4. MAHOUT-19.txt
        111 kB
        Karl Wettin
      5. MAHOUT-19.txt
        262 kB
        Karl Wettin
      6. MAHOUT-19.txt
        241 kB
        Karl Wettin
      7. MAHOUT-19.txt
        235 kB
        Karl Wettin
      8. MAHOUT-19_20100319.diff
        50 kB
        Karl Wettin

        Issue Links

          Activity

          Sean Owen made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Karl Wettin made changes -
          Attachment MAHOUT-19_20100319.diff [ 12439321 ]
          Hide
          Karl Wettin added a comment -

          Working but non scalable, non thread safe and potentially RAM hogging implementation.

          This is a refactored version of something I've been using for a while. I think no bugs managed to get in there, at least the test case works.

          It is (as always) in dire need of more documentation but I think the very small and simple test case might be enough if you already know when a heirarchial cluster makes more sense. I'll try to add some this weekend but I'm more busy than ever right now. I have been missing all discussions on the board for a long time so I'm not going to commit anything unless you tell me this is something you want me to commit.

          Show
          Karl Wettin added a comment - Working but non scalable, non thread safe and potentially RAM hogging implementation. This is a refactored version of something I've been using for a while. I think no bugs managed to get in there, at least the test case works. It is (as always) in dire need of more documentation but I think the very small and simple test case might be enough if you already know when a heirarchial cluster makes more sense. I'll try to add some this weekend but I'm more busy than ever right now. I have been missing all discussions on the board for a long time so I'm not going to commit anything unless you tell me this is something you want me to commit.
          Sean Owen made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Later [ 7 ]
          Hide
          Sean Owen added a comment -

          Sounds like this is something that should be marked "Later"? Sounds like it's mostly done but not 100% complete, and perhaps not time to finish and maintain it?

          Show
          Sean Owen added a comment - Sounds like this is something that should be marked "Later"? Sounds like it's mostly done but not 100% complete, and perhaps not time to finish and maintain it?
          Hide
          Karl Wettin added a comment -

          The code should work, it is however in need of some sort of persistence layer to store the tree. It could run in memory but that would not allow for a very big tree. if i'm not misstaken there is code in the patch that stores the tree in a BDB JE store.

          Show
          Karl Wettin added a comment - The code should work, it is however in need of some sort of persistence layer to store the tree. It could run in memory but that would not allow for a very big tree. if i'm not misstaken there is code in the patch that stores the tree in a BDB JE store.
          Hide
          Sean Owen added a comment -

          Looks like there was a lot of work done here that never was committed?

          Same question, in the name of housekeeping, should this be committed or shelves? We should commit if it's reasonably complete, has some tests, and someone might reasonably maintain it over time.

          Show
          Sean Owen added a comment - Looks like there was a lot of work done here that never was committed? Same question, in the name of housekeeping, should this be committed or shelves? We should commit if it's reasonably complete, has some tests, and someone might reasonably maintain it over time.
          Hide
          Karl Wettin added a comment -

          1. Assuming you are training the tree top-down, what is the division criteria you are using ?

          The distributed code use bottom-up training, one could however add some code that trains it top-down.

          2. How well does it scale ?

          It scales OK, the tree grows rather large though.

          3. Was the data on which this was tried, sparse ?

          I've tried a bit of everything. My aim is to produce a tree representing a Lucene index in order to supply instant "more like this" results. Such data is definetly sparse.

          4. What is the distance metric that has been used ?

          It can use any of the metrics in Mahout. For my text tests I use Tanimoto (Jaccard).

          Basically I have a use -case where-in I have a set of 5 - 10 million urls which have an inherent hierarchical relationship and a set of user-clicks on them. I would like to cluster them in a tree and use the model to answer the near neighborhood type queries i.e. what urls are related to what other urls. I did implement a sequential bottom-up hierarchical clustering algorithm but the complexity is too bad for my data-set. I then thought about implementing a top-down hierarchical clustering algorithm using Jaccard co-efficient as my distance measure and came across this patch. Can you suggest if this patch will help?

          The tree will probably be huge, but disk space is cheap. There are a few ways of improving speed, one could for instance find the closest clusters and navigate to the closest leaf rather than iterating all leafs. That should speed things up a lot. I've calculated it to be something like factor 10 on text data. There is however no support for that right now but I think there are some comments about it in the code.

          If 5-10 million instances will take a long time to train, that I do not know. But once trained it will only take a few milliseconds to extract a cluster for a given trained instance.

          There is a bug in this code, a tree will only grow in one child of the root. Not a big problem and it should be rather easy to fix too. Just need to find some time to trace it out.

          Another problem is the tree persistency. Currently it use PersistentHashMap, something I hacked up for this patch. There is also a contrib module that use BerkeleyDB JE, and that is much more nice. PersistentHashMap has actually been factored out, improved and posted as a brand new project called BananaDB, available in Apache Labs:

          http://svn.apache.org/repos/asf/labs/bananadb/

          The BerkeleyDB dependency (read: the Sleepycat License) might not be a problem anymore as we are moving towards Maven. We should probably talk about that on the dev list.

          Show
          Karl Wettin added a comment - 1. Assuming you are training the tree top-down, what is the division criteria you are using ? The distributed code use bottom-up training, one could however add some code that trains it top-down. 2. How well does it scale ? It scales OK, the tree grows rather large though. 3. Was the data on which this was tried, sparse ? I've tried a bit of everything. My aim is to produce a tree representing a Lucene index in order to supply instant "more like this" results. Such data is definetly sparse. 4. What is the distance metric that has been used ? It can use any of the metrics in Mahout. For my text tests I use Tanimoto (Jaccard). Basically I have a use -case where-in I have a set of 5 - 10 million urls which have an inherent hierarchical relationship and a set of user-clicks on them. I would like to cluster them in a tree and use the model to answer the near neighborhood type queries i.e. what urls are related to what other urls. I did implement a sequential bottom-up hierarchical clustering algorithm but the complexity is too bad for my data-set. I then thought about implementing a top-down hierarchical clustering algorithm using Jaccard co-efficient as my distance measure and came across this patch. Can you suggest if this patch will help? The tree will probably be huge, but disk space is cheap. There are a few ways of improving speed, one could for instance find the closest clusters and navigate to the closest leaf rather than iterating all leafs. That should speed things up a lot. I've calculated it to be something like factor 10 on text data. There is however no support for that right now but I think there are some comments about it in the code. If 5-10 million instances will take a long time to train, that I do not know. But once trained it will only take a few milliseconds to extract a cluster for a given trained instance. There is a bug in this code, a tree will only grow in one child of the root. Not a big problem and it should be rather easy to fix too. Just need to find some time to trace it out. Another problem is the tree persistency. Currently it use PersistentHashMap, something I hacked up for this patch. There is also a contrib module that use BerkeleyDB JE, and that is much more nice. PersistentHashMap has actually been factored out, improved and posted as a brand new project called BananaDB, available in Apache Labs: http://svn.apache.org/repos/asf/labs/bananadb/ The BerkeleyDB dependency (read: the Sleepycat License) might not be a problem anymore as we are moving towards Maven. We should probably talk about that on the dev list.
          Hide
          Ankur added a comment -

          Hi Karl, Welcome back
          Can you share the following few things about this patch?

          1. Assuming you are training the tree top-down, what is the division criteria you are using ?
          2. How well does it scale ?
          3. Was the data on which this was tried, sparse ?
          4. What is the distance metric that has been used ?

          Basically I have a use -case where-in I have a set of 5 - 10 million urls which have an inherent hierarchical relationship and a set of user-clicks on them. I would like to cluster them in a tree and use the model to answer the near neighborhood type queries i.e. what urls are related to what other urls. I did implement a sequential bottom-up hierarchical clustering algorithm but the complexity is too bad for my data-set. I then thought about implementing a top-down hierarchical clustering algorithm using Jaccard co-efficient as my distance measure and came across this patch.

          Can you suggest if this patch will help?

          Show
          Ankur added a comment - Hi Karl, Welcome back Can you share the following few things about this patch? 1. Assuming you are training the tree top-down, what is the division criteria you are using ? 2. How well does it scale ? 3. Was the data on which this was tried, sparse ? 4. What is the distance metric that has been used ? Basically I have a use -case where-in I have a set of 5 - 10 million urls which have an inherent hierarchical relationship and a set of user-clicks on them. I would like to cluster them in a tree and use the model to answer the near neighborhood type queries i.e. what urls are related to what other urls. I did implement a sequential bottom-up hierarchical clustering algorithm but the complexity is too bad for my data-set. I then thought about implementing a top-down hierarchical clustering algorithm using Jaccard co-efficient as my distance measure and came across this patch. Can you suggest if this patch will help?
          Grant Ingersoll made changes -
          Workflow jira [ 12427689 ] no-reopen-closed, patch-avail [ 12444693 ]
          Karl Wettin made changes -
          Attachment MAHOUT-19.txt [ 12389237 ]
          Hide
          Karl Wettin added a comment -

          Rewrote most ugly code from scratch, added some tree integrity checks and simplified a bunch of things.

          I'll try to get this hooked up with 20newsgroups now.

          Show
          Karl Wettin added a comment - Rewrote most ugly code from scratch, added some tree integrity checks and simplified a bunch of things. I'll try to get this hooked up with 20newsgroups now.
          Karl Wettin made changes -
          Link This issue is blocked by HADOOP-3977 [ HADOOP-3977 ]
          Karl Wettin made changes -
          Attachment MAHOUT-19.txt [ 12388678 ]
          Hide
          Karl Wettin added a comment -

          This patch depends on Hadoop 0.19 (trunk) and JDK 1.6

          Primary local fs storage of cluster tree is PersistentMap, a transactionless quick and dirty Map<Writable, Writable> that use RandomAccessFile and consume no RAM no matter the size of the map. There is also support for BDB JE in the contrib/jeTree package, this is however not yet handled from Ant/Maven. Still looking for a better alternative.

          Show
          Karl Wettin added a comment - This patch depends on Hadoop 0.19 (trunk) and JDK 1.6 Primary local fs storage of cluster tree is PersistentMap, a transactionless quick and dirty Map<Writable, Writable> that use RandomAccessFile and consume no RAM no matter the size of the map. There is also support for BDB JE in the contrib/jeTree package, this is however not yet handled from Ant/Maven. Still looking for a better alternative.
          Hide
          Karl Wettin added a comment -

          This is just a comment that I've been really busy and will be for another week or two. Then I've got some time to clean this up and commit it.

          Show
          Karl Wettin added a comment - This is just a comment that I've been really busy and will be for another week or two. Then I've got some time to clean this up and commit it.
          Karl Wettin made changes -
          Attachment MAHOUT-19.txt [ 12380731 ]
          Hide
          Karl Wettin added a comment -

          This works now. Just needs a bit better tests and the tree->dot graphviz code needs to be fixed again before I want to commit anything. Also want to try out those training optimization strategies I've written about earlier (find closest cluster or leaf node first and then the closest instance) and have a few small todos.

          It uses my quick and dirty PersistentMap, a Map<Writable, Writable> that keeps all data on disk at all time using RandomAccessFile (local storage, not dfs) but will probably be replaced by BSDed jdbm.sourceforge.net that Andrzej pointed out.

          Show
          Karl Wettin added a comment - This works now. Just needs a bit better tests and the tree->dot graphviz code needs to be fixed again before I want to commit anything. Also want to try out those training optimization strategies I've written about earlier (find closest cluster or leaf node first and then the closest instance) and have a few small todos. It uses my quick and dirty PersistentMap, a Map<Writable, Writable> that keeps all data on disk at all time using RandomAccessFile (local storage, not dfs) but will probably be replaced by BSDed jdbm.sourceforge.net that Andrzej pointed out.
          Hide
          Karl Wettin added a comment -

          This is doing quite OK now. It needs to add quite a number of instances before it makes sense to distribute the calculations. A leaf can now contain many instances if they are similar enough as simple pruning to avoid super deep trees. There are a couple of optimizable todos in the code.

          All that's left to be usable is to persist the tree so other applications can access it.

          This is a rather slow algorithm as the tree grows big. Perhaps it could work to keep track of all clusters and calculate their mean instance and look for the n closest mean cluster instance and in a second iteration look for the closest instances available. But I don't know, this might have similar effects as top feeding the tree..

          I'm also not sure if I bottom feed the best way right now. Once the closest instance is found I look for the closest node in the chain of parents towards root. It might actually be good to also look if sibling nodes along the way up is closer to the new instance.

          I don't know. I'll have to test some.

          Show
          Karl Wettin added a comment - This is doing quite OK now. It needs to add quite a number of instances before it makes sense to distribute the calculations. A leaf can now contain many instances if they are similar enough as simple pruning to avoid super deep trees. There are a couple of optimizable todos in the code. All that's left to be usable is to persist the tree so other applications can access it. This is a rather slow algorithm as the tree grows big. Perhaps it could work to keep track of all clusters and calculate their mean instance and look for the n closest mean cluster instance and in a second iteration look for the closest instances available. But I don't know, this might have similar effects as top feeding the tree.. I'm also not sure if I bottom feed the best way right now. Once the closest instance is found I look for the closest node in the chain of parents towards root. It might actually be good to also look if sibling nodes along the way up is closer to the new instance. I don't know. I'll have to test some.
          Karl Wettin made changes -
          Attachment MAHOUT-19.txt [ 12380545 ]
          Hide
          Karl Wettin added a comment -

          v2

          Show
          Karl Wettin added a comment - v2
          Hide
          Karl Wettin added a comment -

          Ok, here is the current map reduce strategy for building the tree:

          This will not work. Leafs must be inserted one at the time in the tree in order to get the second dimension correct. Perhaps it is possible to run a second job where all paired leafs are moved to the closest branch, but I don't think so. The tree will be messed up from the start. So now I'm back at the original strategy to run train young trees non distributed and then execute one job for each instance to be inserted.

          This requires appending to files on dfs so I have hacked up an appender (open, write, append, close) until it's available in Hadoop.

          Show
          Karl Wettin added a comment - Ok, here is the current map reduce strategy for building the tree: This will not work. Leafs must be inserted one at the time in the tree in order to get the second dimension correct. Perhaps it is possible to run a second job where all paired leafs are moved to the closest branch, but I don't think so. The tree will be messed up from the start. So now I'm back at the original strategy to run train young trees non distributed and then execute one job for each instance to be inserted. This requires appending to files on dfs so I have hacked up an appender (open, write, append, close) until it's available in Hadoop.
          Hide
          Karl Wettin added a comment - - edited

          Ok, here is the current map reduce strategy for building the tree:

          add all instances to tree as leaf nodes with no parents;
          while (tree contains more than 2 nodes with no parent) {
            create file with all permutation of nodes with no parent;
            execute job to find what node pairs are closest to each other;
            add nodes pairs as siblings to each other; (create new branches with node pairs as children)
            calculate mean vectors in all new branches; (another map reduce job)
          }
          place last two parentless nodes in root node;
          calculate root mean vector;
          

          Any comments to that?

          Show
          Karl Wettin added a comment - - edited Ok, here is the current map reduce strategy for building the tree: add all instances to tree as leaf nodes with no parents; while (tree contains more than 2 nodes with no parent) { create file with all permutation of nodes with no parent; execute job to find what node pairs are closest to each other; add nodes pairs as siblings to each other; (create new branches with node pairs as children) calculate mean vectors in all new branches; (another map reduce job) } place last two parentless nodes in root node; calculate root mean vector; Any comments to that?
          Hide
          Karl Wettin added a comment -

          This is the first real thing I've done with Hadoop. It would be great with some input on how I have used it.

          Don't. I just figured something out.

          Show
          Karl Wettin added a comment - This is the first real thing I've done with Hadoop. It would be great with some input on how I have used it. Don't. I just figured something out.
          Hide
          Sean Owen added a comment -

          This is probably barely relevant but thought I would mention that the cluster-based recommenders I've used define distance between clusters as either the minimum distance between any pair of nodes, one from the first cluster and one from the other. Alternatively you could define it as the max distance between any pair. Dunno, maybe that idea could translate to this situation is the distance metric is proving problematic.

          Show
          Sean Owen added a comment - This is probably barely relevant but thought I would mention that the cluster-based recommenders I've used define distance between clusters as either the minimum distance between any pair of nodes, one from the first cluster and one from the other. Alternatively you could define it as the max distance between any pair. Dunno, maybe that idea could translate to this situation is the distance metric is proving problematic.
          Hide
          Karl Wettin added a comment -

          The bottom feed graph actually demonstrate a few problems.

          Max distance between nodes to be included in a cluster is the mean distance to leaf node siblings. Currently 1.5 or so due to the number of leaf nodes that has 0 distance to the sibling. The end result should be something like 40 for best result on this data.

          Look at the yellow leaf nodes.

          If a leaf node can be able to represent more than one vector when the distance between them is below a threshold the yellow would be gathered as a single child to the branch with 49,042 between siblings. Looking at the rest of the graph I think this will end up with a mean distance of 30 or that resilts in much larger, fewer and better clusters.

          Show
          Karl Wettin added a comment - The bottom feed graph actually demonstrate a few problems. Max distance between nodes to be included in a cluster is the mean distance to leaf node siblings. Currently 1.5 or so due to the number of leaf nodes that has 0 distance to the sibling. The end result should be something like 40 for best result on this data. Look at the yellow leaf nodes. If a leaf node can be able to represent more than one vector when the distance between them is below a threshold the yellow would be gathered as a single child to the branch with 49,042 between siblings. Looking at the rest of the graph I think this will end up with a mean distance of 30 or that resilts in much larger, fewer and better clusters.
          Karl Wettin made changes -
          Link This issue is blocked by MAHOUT-20 [ MAHOUT-20 ]
          Hide
          Karl Wettin added a comment -

          This is the first real thing I've done with Hadoop. It would be great with some input on how I have used it. Pretend that DistributedBottomFeed was a driver class.

          Show
          Karl Wettin added a comment - This is the first real thing I've done with Hadoop. It would be great with some input on how I have used it. Pretend that DistributedBottomFeed was a driver class.
          Karl Wettin made changes -
          Field Original Value New Value
          Attachment TestTopFeed.test.png [ 12380072 ]
          Attachment TestBottomFeed.test.png [ 12380073 ]
          Attachment MAHOUT-19.txt [ 12380071 ]
          Hide
          Karl Wettin added a comment -

          Implementation with non distributed top feed and bottom feed. The latter is used by the map reduced tree feed when the tree is too you for it to make sense to distribute. Top feed is most a proof of concept to show the difference.

          Patch also contains MAHOUT-20 and MAHOUT-36.

          Attached is also the graphviz output from the generated test data I've placed in test/resources.

          There are a couple of changes and additions to the core that should be extracted to own patches:

          • build.xml creates a job file like nutch does.
          • Mht - reads ARFFish text files to a Matrix and map ordinal values to doubles.
          • Vector#fill(double)
          • Vector#fill(double, int, int)
          • VectorStringUtils - converts a vector to a string and vice verse
          • DoubleWritable - see HADOOP-3243
          • SparseVectorWritable - accepts any vector but resolves as SparseVector

          I simulate the hadoop framework to test DistributedBottomFeed as all driver code is inside of the same class and will not run for obvious reasons. I'll refactor and get it running in hadoop for real soon.

          The tree is an abstract relation storage that needs lots of refactoring.

          From Package.html:

          Feeding a tree with instances

          All nodes in the tree are either braches with two children, a leaf (instance) or the root.

                root
                2,75
               /   \
            leaf1  branch1
             4.0     1,5
                    /   \
                 leaf2  leaf3
                  1,0    2,0
          

          A new instance is inserted in the tree as the child of a new branch next to the closes existing node.

          Insert new instance 1,2:

                root
                2,775
               /   \
            leaf1  branch1
             4.0     1,55
                    /   \
                 brach2 leaf3
                   1,1   2,0
                  /   \
               leaf2 leaf4
                1,0   1,2
          

          <p>
          Adding an instance requires sequencially recalculate the mean value (vector space) of all the nodes from the new leaf
          to root. This is a <b>non distributable operation</b>. Finding the where to insert the new leaf is however.
          Traditionally this is done by top or bottom feeding it, i.e. find the closest leaf node by navigating from root
          towards the closest child until a leaf is reached, or by comparing against all leafs and from there navigate the the
          closest node.
          </p>

          <p>
          The first distributed implementation will assert you have an eternal amount of Hadoop nodes and calculate the distance
          to all nodes: root, branches and leafs, in one large job in search for the closest one.
          </p>

          <p>
          A second implementation will spawn multiple jobs before it knows where to insert an instance:
          <ul>
          <li>find closest leaf node</li>
          <li>find closest node between closest leaf node and root</li>
          </ul>
          </p>

          <p>
          Adding instances to a young tree does not make sense to distribute! The first couple of hundred (or so depending on
          vector size) instances could be bottom fed non distributed. (By the way, what is a good antonym to distributed?)
          </p>

          Clustering

          <p>
          Clustering is to navigate the tree from a given node using some set of rules. Usually this means looking at the
          distance between children of a node or the parent of a node. The default max distance computations are based on
          mean distance between leafs that are siblings to other leafs, or the mean distance to all leaf node siblings. I'm
          certain there are much better solutions based on distance to root squared out in ways together with previous mentioned
          values. Or find the value using classes in training data? What algorithm could that be?
          </p>

          <p>
          If set very sensitive for what to include in a cluster one could train use a classifier for each cluster to allow
          neighbouring clusters to join with the classifying cluster. And perhaps even connect clusters that seems to contain
          the same class even if the clusters are far away from each other. But that has to wait until we have classifiers..
          </p>

          Persistency

          <p>
          The tree is an abstract relational object storage for distances, vectores, clusters, et c., accessed via primary keys
          of the nodes. Only a HashMap implementation is available.
          </p>

          {html}
          Show
          Karl Wettin added a comment - Implementation with non distributed top feed and bottom feed. The latter is used by the map reduced tree feed when the tree is too you for it to make sense to distribute. Top feed is most a proof of concept to show the difference. Patch also contains MAHOUT-20 and MAHOUT-36 . Attached is also the graphviz output from the generated test data I've placed in test/resources. There are a couple of changes and additions to the core that should be extracted to own patches: build.xml creates a job file like nutch does. Mht - reads ARFFish text files to a Matrix and map ordinal values to doubles. Vector#fill(double) Vector#fill(double, int, int) VectorStringUtils - converts a vector to a string and vice verse DoubleWritable - see HADOOP-3243 SparseVectorWritable - accepts any vector but resolves as SparseVector I simulate the hadoop framework to test DistributedBottomFeed as all driver code is inside of the same class and will not run for obvious reasons. I'll refactor and get it running in hadoop for real soon. The tree is an abstract relation storage that needs lots of refactoring. From Package.html: Feeding a tree with instances All nodes in the tree are either braches with two children, a leaf (instance) or the root. root 2,75 / \ leaf1 branch1 4.0 1,5 / \ leaf2 leaf3 1,0 2,0 A new instance is inserted in the tree as the child of a new branch next to the closes existing node. Insert new instance 1,2: root 2,775 / \ leaf1 branch1 4.0 1,55 / \ brach2 leaf3 1,1 2,0 / \ leaf2 leaf4 1,0 1,2 <p> Adding an instance requires sequencially recalculate the mean value (vector space) of all the nodes from the new leaf to root. This is a <b>non distributable operation</b>. Finding the where to insert the new leaf is however. Traditionally this is done by top or bottom feeding it, i.e. find the closest leaf node by navigating from root towards the closest child until a leaf is reached, or by comparing against all leafs and from there navigate the the closest node. </p> <p> The first distributed implementation will assert you have an eternal amount of Hadoop nodes and calculate the distance to all nodes: root, branches and leafs, in one large job in search for the closest one. </p> <p> A second implementation will spawn multiple jobs before it knows where to insert an instance: <ul> <li>find closest leaf node</li> <li>find closest node between closest leaf node and root</li> </ul> </p> <p> Adding instances to a young tree does not make sense to distribute! The first couple of hundred (or so depending on vector size) instances could be bottom fed non distributed. (By the way, what is a good antonym to distributed?) </p> Clustering <p> Clustering is to navigate the tree from a given node using some set of rules. Usually this means looking at the distance between children of a node or the parent of a node. The default max distance computations are based on mean distance between leafs that are siblings to other leafs, or the mean distance to all leaf node siblings. I'm certain there are much better solutions based on distance to root squared out in ways together with previous mentioned values. Or find the value using classes in training data? What algorithm could that be? </p> <p> If set very sensitive for what to include in a cluster one could train use a classifier for each cluster to allow neighbouring clusters to join with the classifying cluster. And perhaps even connect clusters that seems to contain the same class even if the clusters are far away from each other. But that has to wait until we have classifiers.. </p> Persistency <p> The tree is an abstract relational object storage for distances, vectores, clusters, et c., accessed via primary keys of the nodes. Only a HashMap implementation is available. </p> {html}
          Karl Wettin created issue -

            People

            • Assignee:
              Karl Wettin
              Reporter:
              Karl Wettin
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development