Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26172

Unify String Params' case-insensitivity in ML

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 3.0.0
    • None
    • ML
    • None

    Description

      For now, there are three ways to deal with case-insensitivity in ML:

      1, support case-insensitivity, e.g. LogisticRegression;

      2, support case-insensitivity, but with getter returning the lower case value (not the value passed to setter), e.g. ALS,DecisionTreeClassifier;

      3, do not support case-insensitivity, e.g. NaiveBayes

       

      This situation result in confusion in usage. 

      I think we should choose the first way to support case-insensitivity of all non-columnName string params, including:

      • LogisticRegression: family
      • MultilayerPerceptronClassifier: solver
      • NaiveBayes: modelType
      • DecisionTreeClassifier: impurity
      • RandomForestClassifier: featureSubsetStrategy, impurity
      • GBTClassifier: featureSubsetStrategy, impurity, lossType
      • {{}}
      • LinearRegression: solver, loss
      • GeneralizedLinearRegression: family, link, solver
      • DecisionTreeRegressor: impurity
      • RandomForestRegressor: featureSubsetStrategy, impurity
      • GBTRegressor: featureSubsetStrategy, impurity, lossType
      • {{}}
      • {{KMeans: }}initMode
      • LDA: optimizer
      • PowerIterationClustering{{: }}initMode
      • ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
      • Bucketizer: handleInvalid
      • ChiSqSelector: selectorType
      • Imputer: strategy
      • QuantileDiscretizer: handleInvalid
      • RFormula: handleInvalid, stringIndexerOrderType
      • StringIndexer: handleInvalid, stringOrderType
      • VectorAssembler: handleInvalid
      • VectorIndexer: handleInvalid
      • VectorSizeHint: handleInvalid
      • OneHotEncoderEstimator: handleInvalid (this will be let alone until the breaking change)
      • BinaryClassificationEvaluator: metricName
      • MulticlassClassificationEvaluator: metricName
      • RegressionEvaluator: metricName
      • ClusteringEvaluator: metricName, distanceMeasure

       

       

       

      To to this:

      • methods lowerCaseInArray and upperCaseInArray are created in ParamValidators to check case-insensitivity;
      • methods  {{$$(param: Param[String])}} and %%(param: Param[String]) are created in trait Params to lower/upper the param value conveniently, and this can minimize the modifications in existing codes, since in many cases we only need to change $(param) to $${param};
      • in SharedParamsCodeGen, handleInvalid and distanceMeasure are updated to use  lowerCaseInArray

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            4 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment