Uploaded image for project: 'SystemDS'
  1. SystemDS
  2. SYSTEMDS-493

Modularize Existing DML Algorithms

    XMLWordPrintableJSON

Details

    • Modularize Existing DML Algorithms

    Description

      Currently, our provided DML algorithms come in the form of single, long scripts that contain the read and write statements, are usually not broken up into modular UDFs, and require the user to supply all arguments via the command line or bash scripts. As a high-level example:

      // read statements, parameter parsing, etc.
      X = read(...)
      hyperparam1 = $1
      anotherHyperparam = $2
      ...
      
      // core part of the algorithm
      // note: this is not broken up by a udf, and instead is just a continuation of the script
      while(!converged) {
       // do stuff
      }
      
      // outputs, test results, stats, etc
      write(...)
      print(...)
      

      The issue here is that many ML algorithms require hyperparameter tuning, and are part of a general data flow (data ingestion, cleaning, splitting, etc.). Due to this, it would be ideal if our algorithm scripts were modularized so that the core parts of the algorithms were wrapped in UDFs (i.e. train(...), test(...), etc.). Then, rather than having to perform these additional steps from a bash script, a user could instead import our algorithm scripts from DML, and make calls to the UDFs as necessary. As an example of the modification to our scripts:

      // read statements, parameter parsing, etc.
      X = read(...)
      hyperparam1 = $1
      anotherHyperparam = $2
      ...
      
      // core part of the algorithm
      // note: this is wrapped in a udf, thus allowing the user to import and supply arguments from another DML script if desired
      train = function (matrix[double] X, double hyperparam1, double hyperparam2) return (matrix[double] model) {
          while(!converged) {
           // do stuff
          }
      }
      
      // when run as a script, this will invoke the `train(...)` function, thus achieving the same result as the previous script design
      model = train(X, hyperparam1, anotherHyperparam)
      
      // outputs, test results, stats, etc
      write(...)
      print(...)
      

      By modularizing the core parts of the algorithms into UDFs, yet still keep the surrounding read/write statements, this will allow our provided scripts to be executed as scripts in the (currently) normal fashion, while also allowing them to be imported from other DML scripts for the use of the UDFs directly. As an example of a custom DML workflow script:

      // import
      source("LinearReg.dml") as lr
      // ingest data
      X_dirty = read(...)
      
      // clean data
      X = ...
      
      // split
      X_train = ...
      X_val = ...
      X_test = ...
      
      // hyperparameter tuning
      while(tuning) {
          hyperparam1= ...
          hyperparam2 = ...
          model = lr::train(X, hyperparam1, hyperparam2)
          error = lr::test(X_val, ...)
          ...
      }
      
      // use best hyperparameters
      ...
      
      // save model
      write(model)
      

      This change could be applied to all of our provided DML algorithms, and many could be broken up into train(...), test(...), stats(...), etc. functions. The goal here is to promote the use of DML for the entire ML pipeline (i.e. the way Python, R, Scala, etc. are currently being used), rather than encouraging the use of cumbersome bash scripts.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dusenberrymw Mike Dusenberry
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: