[SPARK-49615] Feature transformers are case sensitive when unintented - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.3
Fix Version/s: 4.0.0
Component/s: ML, MLlib, Spark Core
Labels:
- pull-request-available

Description

Hi team,

https://spark.apache.org/docs/latest/ml-features

The feature transformers are case sensitive even though the configuration

spark.conf.get("spark.sql.caseSensitive")

is set to false. The user of all these transformers are forced to abide by case of the column in the dataframe

 val data = List(Row("the movie was great", "positive", 10, "greatest of all time"),
    Row("the movie was average", "negative", 11, "just average things, average storyline"),
    Row("movie was fun", "positive", 2, "superb screen play"))
  val schema = new StructType()
    .add("comments", StringType, true)
    .add("reviews", StringType, true)
    .add("counts", IntegerType, true)
    .add("Additional_COMMENTS", StringType, true)
val df = spark.createDataFrame(data.asJava, schema)
  val si = new StringIndexer().setInputCol("additional_comments").setOutputCol("si_additional_comments")
  si.fit(df).transform(df).show()

The above code fails with

Exception in thread "main" org.apache.spark.SparkException: Input column additional_comments does not exist.
    at org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
    at org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema(StringIndexer.scala:123)
    at org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema$(StringIndexer.scala:115)

Which means that the column "additional_comments" needs to be provided in the same case as in the dataframe.

I think when the caseSensitive setting is set to false we should be able to use the naming in any case.

Can someone please help to solve this bug for all transformers.?

Attachments

Issue Links

links to

GitHub Pull Request #48398

GitHub Pull Request #48747

Activity

People

Assignee:: Weichen Xu

Reporter:: Chhavi Bansal

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Sep/24 12:30

Updated:: 04/Nov/24 21:41

Resolved:: 04/Nov/24 21:41