Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Story
`As a data scientist`
I want to call a generic PL/Python UDF from SQL to fit a model
`so that`
I can use the use any code I write or Python libraries for model builing.
Interface
fit( source_table, -- source table model_table, -- model output table list_of_columns, -- columns you want in GD, could be '*' list_of_columns_to_exclude, -- columns to explicitly exclude fit_udf, -- plpython UDF to fit model fit_udf_parameters, -- parameters for UDF, if any grouping_cols -- groups to build separate models for (source table distributed by this grouping) );
Arguments
source_table TEXT. Name of the table containing the data to load. model_table TEXT. Name of the table containing the model(s), with one row per group. list_of_columns TEXT. Comma-separated string of column names or expressions to load. Can also be '*' implying all columns are to be loaded (except for the ones included in the next argument that lists exclusions). The types of the columns can be mixed. Array columns can also be included in the list and will be loaded as is (i.e., not be flattened). (???) list_of_columns_to_exclude TEXT. Comma-separated string of column names to exclude from load. Typically used when 'list_of_columns' is set to '*'. fit_udf TEXT. plpython UDF to fit model. fit_udf_parameters (optional) TEXT. parameters for UDF, if any grouping_cols (optional) TEXT, default: NULL. Comma-separated list of column names to group the data by. This will produce multiple models, one for each group.
Open questions
1) Do we need separate fit functions for R and Python, or can we autodetect?
If we need separate ones, could call this module `fit_plpythonu' and the R one would be `fit_plr`.
Notes
1) Both keras & scikit-learn use the term `fit` which seems better than `train`.
(We will use the term `predict` for prediction in a separate story.)
Acceptance
1) Generate a model table for sample data set with multiple groups using a scikit-learn model.
2) Repeat for Keras/TF.
3) Repeat for XGBoost.