[SPARK-23945] Column.isin() should accept a single-column DataFrame as input - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

In SQL you can filter rows based on the result of a subquery:

SELECT *
FROM table1
WHERE name NOT IN (
    SELECT name
    FROM table2
);

In the Spark DataFrame API, the equivalent would probably look like this:

(table1
    .where(
        ~col('name').isin(
            table2.select('name')
        )
    )
)

However, .isin() currently only accepts a local list of values.

I imagine making this enhancement would happen as part of a larger effort to support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame API, and queries like this should instead be written in a different way? How would we write a query like the one I have above in the DataFrame API, without needing to collect values locally for the NOT IN filter?

Attachments

Issue Links

is related to

SPARK-18455 General support for correlated subquery processing

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Chammas

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Apr/18 21:53

Updated:: 25/May/21 01:54

Resolved:: 25/May/21 01:38