Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Resolved
    • Affects Version/s: 2.2.1
    • Fix Version/s: None
    • Component/s: ML
    • Labels:
      None

      Description

          val surrogates = $(inputCols).map { inputCol =>
            val ic = col(inputCol)
            val filtered = dataset.select(ic.cast(DoubleType))
              .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
            if(filtered.take(1).length == 0) {
              throw new SparkException(s"surrogate cannot be computed. " +
                s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})")
            }
            val surrogate = $(strategy) match {
              case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
              case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head
            }
            surrogate
          }
      

      Current impl of Imputer process one column after after another. In this place, we should parallelize the processing in a more efficient way.

        Attachments

          Activity

            People

            • Assignee:
              podongfeng zhengruifeng
              Reporter:
              podongfeng zhengruifeng
              Shepherd:
              Yanbo Liang
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: