Description
In PySpark, the following invalid DataFrame join will not result an error:
df1.join(df2, how='not-a-valid-join-type')
The signature for `join` is
def join(self, other, on=None, how=None):
and its code ends up completely skipping handling of the `how` parameter when `on` is `None`:
if on is not None and not isinstance(on, list): on = [on] if on is None or len(on) == 0: jdf = self._jdf.join(other._jdf) elif isinstance(on[0], basestring): if how is None: jdf = self._jdf.join(other._jdf, self._jseq(on), "inner") else: assert isinstance(how, basestring), "how should be basestring" jdf = self._jdf.join(other._jdf, self._jseq(on), how) else:
Given that this behavior can mask user errors (as in the above example), I think that we should refactor this to first process all arguments and then call the three-argument _.jdf.join. This would handle the above invalid example by passing all arguments to the JVM DataFrame for analysis.
I'm not planning to work on this myself, so this bugfix (+ regression test!) is up for grabs in case someone else wants to do it.
Attachments
Issue Links
- is related to
-
SPARK-21264 Omitting columns with 'how' specified in join in PySpark throws NPE
- Resolved
- links to