[SPARK-14761] PySpark DataFrame.join should reject invalid join methods even when join columns are not specified - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.1.0
Component/s: PySpark, SQL
Labels:
- starter

Description

In PySpark, the following invalid DataFrame join will not result an error:

df1.join(df2, how='not-a-valid-join-type')

The signature for `join` is

    def join(self, other, on=None, how=None):

and its code ends up completely skipping handling of the `how` parameter when `on` is `None`:

 if on is not None and not isinstance(on, list):
            on = [on]

        if on is None or len(on) == 0:
            jdf = self._jdf.join(other._jdf)
        elif isinstance(on[0], basestring):
            if how is None:
                jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
            else:
                assert isinstance(how, basestring), "how should be basestring"
                jdf = self._jdf.join(other._jdf, self._jseq(on), how)
        else:

Given that this behavior can mask user errors (as in the above example), I think that we should refactor this to first process all arguments and then call the three-argument _.jdf.join. This would handle the above invalid example by passing all arguments to the JVM DataFrame for analysis.

I'm not planning to work on this myself, so this bugfix (+ regression test!) is up for grabs in case someone else wants to do it.

Attachments

Issue Links

is related to

SPARK-21264 Omitting columns with 'how' specified in join in PySpark throws NPE

Resolved

links to

[Github] Pull Request #12691 (bkpathak)

[Github] Pull Request #15409 (bkpathak)

Activity

People

Assignee:: Bijay Kumar Pathak

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Apr/16 17:37

Updated:: 30/Jun/17 08:23

Resolved:: 12/Oct/16 17:10