Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
3.0.0
-
None
-
None
Description
For now, there are three ways to deal with case-insensitivity in ML:
1, support case-insensitivity, e.g. LogisticRegression;
2, support case-insensitivity, but with getter returning the lower case value (not the value passed to setter), e.g. ALS,DecisionTreeClassifier;
3, do not support case-insensitivity, e.g. NaiveBayes
This situation result in confusion in usage.
I think we should choose the first way to support case-insensitivity of all non-columnName string params, including:
- LogisticRegression: family
- MultilayerPerceptronClassifier: solver
- NaiveBayes: modelType
- DecisionTreeClassifier: impurity
- RandomForestClassifier: featureSubsetStrategy, impurity
- GBTClassifier: featureSubsetStrategy, impurity, lossType
- {{}}
- LinearRegression: solver, loss
- GeneralizedLinearRegression: family, link, solver
- DecisionTreeRegressor: impurity
- RandomForestRegressor: featureSubsetStrategy, impurity
- GBTRegressor: featureSubsetStrategy, impurity, lossType
- {{}}
- {{KMeans: }}initMode
- LDA: optimizer
- PowerIterationClustering{{: }}initMode
- ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
- Bucketizer: handleInvalid
- ChiSqSelector: selectorType
- Imputer: strategy
- QuantileDiscretizer: handleInvalid
- RFormula: handleInvalid, stringIndexerOrderType
- StringIndexer: handleInvalid, stringOrderType
- VectorAssembler: handleInvalid
- VectorIndexer: handleInvalid
- VectorSizeHint: handleInvalid
- OneHotEncoderEstimator: handleInvalid (this will be let alone until the breaking change)
- BinaryClassificationEvaluator: metricName
- MulticlassClassificationEvaluator: metricName
- RegressionEvaluator: metricName
- ClusteringEvaluator: metricName, distanceMeasure
To to this:
- methods lowerCaseInArray and upperCaseInArray are created in ParamValidators to check case-insensitivity;
- methods {{$$(param: Param[String])}} and %%(param: Param[String]) are created in trait Params to lower/upper the param value conveniently, and this can minimize the modifications in existing codes, since in many cases we only need to change $(param) to $${param};
- in SharedParamsCodeGen, handleInvalid and distanceMeasure are updated to use lowerCaseInArray