[SYSTEMDS-493] Modularize Existing DML Algorithms - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: SystemDS 3.1, SystemDS 3.2
Component/s: Algorithms
Labels:
None

Epic Name:
Modularize Existing DML Algorithms

Description

Currently, our provided DML algorithms come in the form of single, long scripts that contain the read and write statements, are usually not broken up into modular UDFs, and require the user to supply all arguments via the command line or bash scripts. As a high-level example:

// read statements, parameter parsing, etc.
X = read(...)
hyperparam1 = $1
anotherHyperparam = $2
...

// core part of the algorithm
// note: this is not broken up by a udf, and instead is just a continuation of the script
while(!converged) {
 // do stuff
}

// outputs, test results, stats, etc
write(...)
print(...)

The issue here is that many ML algorithms require hyperparameter tuning, and are part of a general data flow (data ingestion, cleaning, splitting, etc.). Due to this, it would be ideal if our algorithm scripts were modularized so that the core parts of the algorithms were wrapped in UDFs (i.e. train(...), test(...), etc.). Then, rather than having to perform these additional steps from a bash script, a user could instead import our algorithm scripts from DML, and make calls to the UDFs as necessary. As an example of the modification to our scripts:

// read statements, parameter parsing, etc.
X = read(...)
hyperparam1 = $1
anotherHyperparam = $2
...

// core part of the algorithm
// note: this is wrapped in a udf, thus allowing the user to import and supply arguments from another DML script if desired
train = function (matrix[double] X, double hyperparam1, double hyperparam2) return (matrix[double] model) {
    while(!converged) {
     // do stuff
    }
}

// when run as a script, this will invoke the `train(...)` function, thus achieving the same result as the previous script design
model = train(X, hyperparam1, anotherHyperparam)

// outputs, test results, stats, etc
write(...)
print(...)

By modularizing the core parts of the algorithms into UDFs, yet still keep the surrounding read/write statements, this will allow our provided scripts to be executed as scripts in the (currently) normal fashion, while also allowing them to be imported from other DML scripts for the use of the UDFs directly. As an example of a custom DML workflow script:

// import
source("LinearReg.dml") as lr
// ingest data
X_dirty = read(...)

// clean data
X = ...

// split
X_train = ...
X_val = ...
X_test = ...

// hyperparameter tuning
while(tuning) {
    hyperparam1= ...
    hyperparam2 = ...
    model = lr::train(X, hyperparam1, hyperparam2)
    error = lr::test(X_val, ...)
    ...
}

// use best hyperparameters
...

// save model
write(model)

This change could be applied to all of our provided DML algorithms, and many could be broken up into train(...), test(...), stats(...), etc. functions. The goal here is to promote the use of DML for the entire ML pipeline (i.e. the way Python, R, Scala, etc. are currently being used), rather than encouraging the use of cumbersome bash scripts.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Mike Dusenberry

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jan/16 20:15

Updated:: 30/May/23 10:10

Resolved:: 30/May/23 10:10