Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.2.0, 3.2.1
-
None
-
None
Description
The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive.
Minimal example:
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
{{ [['wrong', 'wrong']], }}
schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is:
case_matched | case_unmatched |
right | wrong |
While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is:
case_matched | case_unmatched |
right | right |
I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner.