[SPARK-28189] Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
None

Description

Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.

vals1 = [('Pirate', 1),('Monkey', 2),('Ninja', 3),('Spaghetti', 4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga', 1),('Pirate', 2),('Ninja', 3),('Darth Vader', 4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] == df2['key'], "left")

drop will become case sensitive. e.g.

# from above, df1 consists of columns ['KEY', 'field']
# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop('caps') # will also give a result

however, note the following

df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive.

Attachments

Issue Links

links to

GitHub Pull Request #25055

GitHub Pull Request #25216

Activity

People

Assignee:: Tony Zhang

Reporter:: Luke Chu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Jun/19 20:35

Updated:: 21/Jul/19 07:12

Resolved:: 07/Jul/19 04:45