Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14761

PySpark DataFrame.join should reject invalid join methods even when join columns are not specified

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.1.0
    • PySpark, SQL

    Description

      In PySpark, the following invalid DataFrame join will not result an error:

      df1.join(df2, how='not-a-valid-join-type')
      

      The signature for `join` is

          def join(self, other, on=None, how=None):
      

      and its code ends up completely skipping handling of the `how` parameter when `on` is `None`:

       if on is not None and not isinstance(on, list):
                  on = [on]
      
              if on is None or len(on) == 0:
                  jdf = self._jdf.join(other._jdf)
              elif isinstance(on[0], basestring):
                  if how is None:
                      jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
                  else:
                      assert isinstance(how, basestring), "how should be basestring"
                      jdf = self._jdf.join(other._jdf, self._jseq(on), how)
              else:
      

      Given that this behavior can mask user errors (as in the above example), I think that we should refactor this to first process all arguments and then call the three-argument _.jdf.join. This would handle the above invalid example by passing all arguments to the JVM DataFrame for analysis.

      I'm not planning to work on this myself, so this bugfix (+ regression test!) is up for grabs in case someone else wants to do it.

      Attachments

        Issue Links

          Activity

            People

              bijay697 Bijay Kumar Pathak
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: