Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23945

Column.isin() should accept a single-column DataFrame as input

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.0
    • None
    • SQL

    Description

      In SQL you can filter rows based on the result of a subquery:

      SELECT *
      FROM table1
      WHERE name NOT IN (
          SELECT name
          FROM table2
      );

      In the Spark DataFrame API, the equivalent would probably look like this:

      (table1
          .where(
              ~col('name').isin(
                  table2.select('name')
              )
          )
      )

      However, .isin() currently only accepts a local list of values.

      I imagine making this enhancement would happen as part of a larger effort to support correlated subqueries in the DataFrame API.

      Or perhaps there is no plan to support this style of query in the DataFrame API, and queries like this should instead be written in a different way? How would we write a query like the one I have above in the DataFrame API, without needing to collect values locally for the NOT IN filter?

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nchammas Nicholas Chammas
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: