Description
Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive.
However, when referring to an upstream table, such as from a join, e.g.
vals1 = [('Pirate', 1),('Monkey', 2),('Ninja', 3),('Spaghetti', 4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga', 1),('Pirate', 2),('Ninja', 3),('Darth Vader', 4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
drop will become case sensitive. e.g.
# from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop('caps') # will also give a result
however, note the following
df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected
so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive.