Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14761

PySpark DataFrame.join should reject invalid join methods even when join columns are not specified

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.1.0
    • PySpark, SQL

    Description

      In PySpark, the following invalid DataFrame join will not result an error:

      df1.join(df2, how='not-a-valid-join-type')
      

      The signature for `join` is

          def join(self, other, on=None, how=None):
      

      and its code ends up completely skipping handling of the `how` parameter when `on` is `None`:

       if on is not None and not isinstance(on, list):
                  on = [on]
      
              if on is None or len(on) == 0:
                  jdf = self._jdf.join(other._jdf)
              elif isinstance(on[0], basestring):
                  if how is None:
                      jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
                  else:
                      assert isinstance(how, basestring), "how should be basestring"
                      jdf = self._jdf.join(other._jdf, self._jseq(on), how)
              else:
      

      Given that this behavior can mask user errors (as in the above example), I think that we should refactor this to first process all arguments and then call the three-argument _.jdf.join. This would handle the above invalid example by passing all arguments to the JVM DataFrame for analysis.

      I'm not planning to work on this myself, so this bugfix (+ regression test!) is up for grabs in case someone else wants to do it.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bijay697 Bijay Kumar Pathak
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment