Description
(Credit to Brooke Wenig for finding it). Quoting:
".. The python code works completely fine, but the scala code is outputting
strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output
for the output of the string indexer, instead of using the column names specified in here:
val stringIndexer = new StringIndexer() .setInputCols(categoricalCols) .setOutputCols(indexOutputCols) .setHandleInvalid("skip")
I was expecting the resulting column names to be
indexOutputCols: Array[String] = Array(host_is_superhostIndex, cancellation_policyIndex, instant_bookableIndex, neighbourhood_cleansedIndex, property_typeIndex, room_typeIndex, bed_typeIndex)
Indeed I'm pretty sure this is the bug:
private def validateAndTransformField( schema: StructType, inputColName: String, outputColName: String): StructField = { val inputDataType = schema(inputColName).dataType require(inputDataType == StringType || inputDataType.isInstanceOf[NumericType], s"The input column $inputColName must be either string type or numeric type, " + s"but got $inputDataType.") require(schema.fields.forall(_.name != outputColName), s"Output column $outputColName already exists.") NominalAttribute.defaultAttr.withName($(outputCol)).toStructField() }
The last line does not use the transformed output col name, but the default single output col parameter.
Attachments
Issue Links
- relates to
-
SPARK-11215 Add multiple columns support to StringIndexer
- Resolved
- links to