[SPARK-38109] pyspark DataFrame.replace() is sensitive to column name case in pyspark 3.2 but not in 3.1 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.2.0, 3.2.1
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

The `subset` argument for `DataFrame.replace()` accepts one or more column names. In pyspark 3.2 the case of the column names must match the column names in the schema exactly or the replacements will not take place. In earlier versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

replace_dict = {'wrong': 'right'}

df = spark.createDataFrame(
{{ [['wrong', 'wrong']], }}
schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])

In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) the result is:

case_matched	case_unmatched
right	wrong

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) the result is:

case_matched	case_unmatched
right	right

I believe the expected behaviour is that shown in pyspark 3.1 as in all other situations column names are accepted in a case insensitive manner.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: ss

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Feb/22 16:37

Updated:: 04/Feb/22 16:39