[IGNITE-8059] Integrate decision tree with partition based dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 2.5
Component/s: ml
Labels:
None

Description

A partition based dataset (new underlying infrastructure component) was added as part of ~~IGNITE-7437~~ and now we need to adopt decision tree algorithm to work on top of this infrastructure.

The way decision tree algorithm is implemented on top of a row-partitioned data is described further.

At first, the basic idea behind any decision tree, bother regression and classification, is to find the data split that allows to minimize an impurity measure like Gini coefficient, entropy or mean squared error. To calculate the best split we need to build a function that describes dependency between split point (independent variable) and impurity measure (dependent variable) and then find a minimum of this function.

In case of a distributed system, when a data is partitioned by row, we can calculate such function on every node, compress it somehow, and then pass it to the master node. On the master node we need to summarize functions received from all nodes and then find a minimum of the result function. It's the way decision tree algorithm is implemented in Apache Ignite ML module.