Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32973

FeatureHasher does not check categoricalCols in inputCols

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.3.0, 2.4.0, 3.0.0, 3.1.0
    • Fix Version/s: 3.1.0
    • Component/s: Documentation, ML
    • Labels:
      None

      Description

      doc related to categoricalCols:

      Numeric columns to treat as categorical features. By default only string and boolean columns are treated as categorical, so this param can be used to explicitly specify the numerical columns to treat as categorical. Note, the relevant columns must also be set in inputCols. 

       

      However, the check to make sure categoricalCols in inputCols was never implemented:

      for example, in 2.4.7 and current master(3.1.0):

      scala> import org.apache.spark.ml.feature._
      import org.apache.spark.ml.feature._
      scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
      import org.apache.spark.ml.linalg.{Vector, Vectors}
      
      scala> val df = Seq((2.0, 1, "foo"),(3.0, 2, "bar")).toDF("real", "int", "string")
      df: org.apache.spark.sql.DataFrame = [real: double, int: int ... 1 more field]
      scala> val n = 100
      n: Int = 100
      scala> val hasher = new FeatureHasher().setInputCols("int", "string").setCategoricalCols(Array("real")).setOutputCol("features").setNumFeatures(n) 
      hasher: org.apache.spark.ml.feature.FeatureHasher = featureHasher_fbe05968b33f
      scala> hasher.transform(df).show
      +----+---+------+--------------------+
      |real|int|string|            features|
      +----+---+------+--------------------+
      | 2.0|  1|   foo|(100,[2,39],[1.0,...|
      | 3.0|  2|   bar|(100,[2,42],[2.0,...|
      +----+---+------+--------------------+
      
      

       

      CategoricalCols "real" is not in inputCols ("int", "string").

       

      I think there are two options:

      1, remove this comment  "Note, the relevant columns must also be set in inputCols. ", since this requirement seems unnecessary;

      2, add a check to make sure all CategoricalCols are in inputCols.

       

       

        Attachments

          Activity

            People

            • Assignee:
              podongfeng zhengruifeng
              Reporter:
              podongfeng zhengruifeng
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: