Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12520

Python API dataframe join returns wrong results on outer join

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.1
    • 1.5.3, 1.6.0, 2.0.0
    • PySpark, SQL
    • None

    Description

      Consider the following dataframes:

      """
      left_table:
      -----------------------------------------+

      head_id_left tail_id_left weight joining_column

      -----------------------------------------+

      1 2 1 1~2

      -----------------------------------------+

      right_table:
      ------------------------------------

      head_id_right tail_id_right joining_column

      ------------------------------------
      ------------------------------------
      """

      The following code returns an empty dataframe:

      """
      joined_table = left_table.join(right_table, "joining_column", "outer")
      """

      joined_table has zero rows.

      However:

      """
      joined_table = left_table.join(right_table, left_table.joining_column == right_table.joining_column, "outer")
      """

      returns the correct answer with one row.

      Attachments

        Activity

          People

            smilegator Xiao Li
            akshan Aravind B
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: