Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20604

Allow Imputer to handle all numeric types

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1.0
    • Fix Version/s: 3.0.0
    • Component/s: ML
    • Labels:
      None

      Description

      Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types.

      The example below shows failure of Imputer on integer data.

          val df = spark.createDataFrame( Seq(
            (0, 1.0, 1.0, 1.0),
            (1, 11.0, 11.0, 11.0),
            (2, 1.5, 1.5, 1.5),
            (3, Double.NaN, 4.5, 1.5)
          )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
          val imputer = new Imputer()
            .setInputCols(Array("value1"))
            .setOutputCols(Array("out1"))
          imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))
      
      java.lang.IllegalArgumentException: requirement failed: Column value1 must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type IntegerType.
      
      

        Attachments

          Activity

            People

            • Assignee:
              actuaryzhang Wayne Zhang
              Reporter:
              actuaryzhang Wayne Zhang
              Shepherd:
              Yanbo Liang
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: