Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30939

StringIndexer setOutputCols does not set output cols

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.0
    • Component/s: ML
    • Labels:
      None
    • Target Version/s:

      Description

      (Credit to Brooke Wenig for finding it). Quoting:

      ".. The python code works completely fine, but the scala code is outputting

      strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output
      

      for the output of the string indexer, instead of using the column names specified in here:

      val stringIndexer = new StringIndexer()
        .setInputCols(categoricalCols)
        .setOutputCols(indexOutputCols)
        .setHandleInvalid("skip")
      

      I was expecting the resulting column names to be

      indexOutputCols: Array[String] = Array(host_is_superhostIndex, cancellation_policyIndex, instant_bookableIndex, neighbourhood_cleansedIndex, property_typeIndex, room_typeIndex, bed_typeIndex)
      

      Indeed I'm pretty sure this is the bug:

        private def validateAndTransformField(
            schema: StructType,
            inputColName: String,
            outputColName: String): StructField = {
          val inputDataType = schema(inputColName).dataType
          require(inputDataType == StringType || inputDataType.isInstanceOf[NumericType],
            s"The input column $inputColName must be either string type or numeric type, " +
              s"but got $inputDataType.")
          require(schema.fields.forall(_.name != outputColName),
            s"Output column $outputColName already exists.")
          NominalAttribute.defaultAttr.withName($(outputCol)).toStructField()
        }
      

      The last line does not use the transformed output col name, but the default single output col parameter.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                srowen Sean R. Owen
                Reporter:
                srowen Sean R. Owen
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: