[SPARK-11106] Should ML Models contains single models or Pipelines? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed

Description

This JIRA is for discussing whether an ML Estimators should do feature processing.

Issue

Currently, almost all ML Estimators require strict input types. E.g., DecisionTreeClassifier requires that the label column be Double type and have metadata indicating the number of classes.

This requires users to know how to preprocess data.

Ideal workflow

A user should be able to pass any reasonable data to a Transformer or Estimator and have it "do the right thing."

E.g.:

If DecisionTreeClassifier is given a String column for labels, it should know to index the Strings.
See ~~SPARK-10513~~ for a similar issue with OneHotEncoder.

Possible solutions

There are a few solutions I have thought of. Please comment with feedback or alternative ideas!

Leave as is

Pro: The current setup is good in that it forces the user to be very aware of what they are doing. Feature transformations will not happen silently.

Con: The user has to write boilerplate code for transformations. The API is not what some users would expect; e.g., coming from R, a user might expect some automatic transformations.

All Transformers can contain PipelineModels

We could allow all Transformers and Models to contain arbitrary PipelineModels. E.g., if a DecisionTreeClassifier were given a String label column, it might return a Model which contains a simple fitted PipelineModel containing StringIndexer + DecisionTreeClassificationModel.

The API could present this to the user, or it could be hidden from the user. Ideally, it would be hidden from the beginner user, but accessible for experts.

The main problem is that we might have to break APIs. E.g., OneHotEncoder may need to do indexing if given a String input column. This means it should no longer be a Transformer; it should be an Estimator.

All Estimators should use RFormula

The best option I have thought of is to make RFormula be the primary method for automatic feature transformation. We could start adding an RFormula Param to all Estimators, and it could handle most of these feature transformation issues.

We could maintain old APIs:

If a user sets the input column names, then those can be used in the traditional (no automatic transformation) way.
If a user sets the RFormula Param, then it can be used instead. (This should probably take precedence over the old API.)

Attachments

Issue Links

contains

SPARK-12808 Formula based GLM in PySpark

Closed

is related to

SPARK-7126 For spark.ml Classifiers, automatically index labels if they are not yet indexed

Resolved

relates to

SPARK-15540 RFormula and R feature processing improvement umbrella

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Oct/15 17:20

Updated:: 21/May/19 04:36

Resolved:: 21/May/19 04:36