Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38109

pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.2.0, 3.2.1
    • None
    • PySpark
    • None

    Description

      The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive.

      Minimal example:

      replace_dict = {'wrong': 'right'}

      df = spark.createDataFrame(
      {{  [['wrong', 'wrong']], }}
        schema=['case_matched', 'case_unmatched']
      )
      df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])

       

      In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is:

      case_matched case_unmatched
      right wrong

      While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is:

      case_matched case_unmatched
      right right

      I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner.

      Attachments

        Activity

          People

            Unassigned Unassigned
            user235354345 ss
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: