Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-49615

Feature transformers are case sensitive when unintented

    XMLWordPrintableJSON

Details

    Description

      Hi team,

      https://spark.apache.org/docs/latest/ml-features

      The feature transformers are case sensitive even though the configuration 

       

      spark.conf.get("spark.sql.caseSensitive"

       

      is set to false. The user of all these transformers are forced to abide by case of the column in the dataframe

       

       val data = List(Row("the movie was great", "positive", 10, "greatest of all time"),
          Row("the movie was average", "negative", 11, "just average things, average storyline"),
          Row("movie was fun", "positive", 2, "superb screen play"))
        val schema = new StructType()
          .add("comments", StringType, true)
          .add("reviews", StringType, true)
          .add("counts", IntegerType, true)
          .add("Additional_COMMENTS", StringType, true)
      val df = spark.createDataFrame(data.asJava, schema)
        val si = new StringIndexer().setInputCol("additional_comments").setOutputCol("si_additional_comments")
        si.fit(df).transform(df).show() 

      The above code fails with 

      Exception in thread "main" org.apache.spark.SparkException: Input column additional_comments does not exist.
          at org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
          at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
          at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
          at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
          at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
          at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
          at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
          at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
          at org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema(StringIndexer.scala:123)
          at org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema$(StringIndexer.scala:115) 

      Which means that the column "additional_comments" needs to be provided in the same case as in the dataframe. 

       

      I think when the caseSensitive  setting is set to false we should be able to use the naming in any case.

       

      Can someone please help to solve this bug for all transformers.?

      Attachments

        Activity

          People

            weichenxu123 Weichen Xu
            chhavibansal Chhavi Bansal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: