[SPARK-26172] Unify String Params' case-insensitivity in ML - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Stop watching

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

Delete

XML

Word

Printable

JSON

For now, there are three ways to deal with case-insensitivity in ML:

1, support case-insensitivity, e.g. LogisticRegression;

2, support case-insensitivity, but with getter returning the lower case value (not the value passed to setter), e.g. ALS,DecisionTreeClassifier;

3, do not support case-insensitivity, e.g. NaiveBayes

This situation result in confusion in usage.

I think we should choose the first way to support case-insensitivity of all non-columnName string params, including:

LogisticRegression: family
MultilayerPerceptronClassifier: solver
NaiveBayes: modelType
DecisionTreeClassifier: impurity
RandomForestClassifier: featureSubsetStrategy, impurity
GBTClassifier: featureSubsetStrategy, impurity, lossType
{{}}
LinearRegression: solver, loss
GeneralizedLinearRegression: family, link, solver
DecisionTreeRegressor: impurity
RandomForestRegressor: featureSubsetStrategy, impurity
GBTRegressor: featureSubsetStrategy, impurity, lossType
{{}}
{{KMeans: }}initMode
LDA: optimizer
PowerIterationClustering{{: }}initMode
ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
Bucketizer: handleInvalid
ChiSqSelector: selectorType
Imputer: strategy
QuantileDiscretizer: handleInvalid
RFormula: handleInvalid, stringIndexerOrderType
StringIndexer: handleInvalid, stringOrderType
VectorAssembler: handleInvalid
VectorIndexer: handleInvalid
VectorSizeHint: handleInvalid
OneHotEncoderEstimator: handleInvalid (this will be let alone until the breaking change)
BinaryClassificationEvaluator: metricName
MulticlassClassificationEvaluator: metricName
RegressionEvaluator: metricName
ClusteringEvaluator: metricName, distanceMeasure

To to this:

methods lowerCaseInArray and upperCaseInArray are created in ParamValidators to check case-insensitivity;
methods {{$$(param: Param[String])}} and %%(param: Param[String]) are created in trait Params to lower/upper the param value conveniently, and this can minimize the modifications in existing codes, since in many cases we only need to change $(param) to $${param};
in SharedParamsCodeGen, handleInvalid and distanceMeasure are updated to use lowerCaseInArray