Description
A partition based dataset (new underlying infrastructure component) was added as part of IGNITE-7437 and now we need to adopt decision tree algorithm to work on top of this infrastructure.
The way decision tree algorithm is implemented on top of a row-partitioned data is described further.
At first, the basic idea behind any decision tree, bother regression and classification, is to find the data split that allows to minimize an impurity measure like Gini coefficient, entropy or mean squared error. To calculate the best split we need to build a function that describes dependency between split point (independent variable) and impurity measure (dependent variable) and then find a minimum of this function.
In case of a distributed system, when a data is partitioned by row, we can calculate such function on every node, compress it somehow, and then pass it to the master node. On the master node we need to summarize functions received from all nodes and then find a minimum of the result function. It's the way decision tree algorithm is implemented in Apache Ignite ML module.
Attachments
Issue Links
- causes
-
IGNITE-8269 Add documentation for decision tree (release 2.5)
- Closed
- links to