[SPARK-26979] [PySpark] Some SQL functions do not take column names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 3.0.0
Component/s: PySpark
Labels:
- release-notes

Docs Text:

Hide
In Spark 3.0, certain Pyspark SQL functions that previously only accepted a Column object to specify a column can now take a string specifying the column by name, for consistency with the rest of the API:

lower()
upper()
abs()
bitwiseNOT()
ltrim()
rtrim()
trim()
ascii()
base64()
unbase64()

Show
In Spark 3.0, certain Pyspark SQL functions that previously only accepted a Column object to specify a column can now take a string specifying the column by name, for consistency with the rest of the API: lower() upper() abs() bitwiseNOT() ltrim() rtrim() trim() ascii() base64() unbase64()

Description

Most SQL functions defined in org.apache.spark.sql.functions have two variations, one taking a Column object as input, and another taking a string representing a column name, which is then converted into a Column object internally.

There are, however, a few notable exceptions:

lower()
upper()
abs()
bitwiseNOT()

While this doesn't break anything, as you can easily create a Column object yourself prior to passing it to one of these functions, it has two undesirable consequences:

It is surprising - it breaks coder's expectations when they are first starting with Spark. Every API should be as consistent as possible, so as to make the learning curve smoother and to reduce causes for human error;
It gets in the way of stylistic conventions. Most of the time it makes Python/Scala/Java code more readable to use literal names, and the API provides ample support for that, but these few exceptions prevent this pattern from being universally applicable.

This is a very easy fix, and I see no reason not to apply it. I have a PR ready.

UPDATE: Turns out there are many exceptions over this pattern that I wasn't aware of. The reason I missed them is because I had been looking at things from PySpark's point of view, and the API there does support column name literals for almost all SQL functions.

Exceptions for the PySpark API include all the above plus:

ltrim()
rtrim()
trim()
ascii()
base64()
unbase64()

The argument for making the API consistent still stands, however. I have been working on a PR to fix this on PySpark's side, and it should still be a painless change.

Attachments

Issue Links

links to

GitHub Pull Request #23879

GitHub Pull Request #23882

GitHub Pull Request #24121

Activity

People

Assignee:: Andre Sa de Mello

Reporter:: Andre Sa de Mello

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Feb/19 15:15

Updated:: 19/Mar/19 23:22

Resolved:: 17/Mar/19 17:58