XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0, 4.0.0
    • None
    • Connect, PySpark

    Description

      SPARK-45220 discovers a behavior difference for a self-join scenario between classic Spark and Spark Connect.

      For instance, here is the query that works without Spark Connect: 

      df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)])
      df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)])
      joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
      joined.show()

      But in Spark Connect, it throws this exception:

      pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].;
      'Sort ['name DESC NULLS LAST], true
      +- Join FullOuter, (name#64 = name#78)
         :- LocalRelation [name#64, age#65L]
         +- LocalRelation [name#78, height#79L]
       

       

      On the other hand, this query failed in classic Spark Connect:

      df.join(df, df.name == df.name, "outer").select(df.name).show() 
      pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... 

       

      but this query works with Spark Connect.

      We need to investigate the behavior difference and fix it.

      Attachments

        Activity

          People

            Unassigned Unassigned
            allisonwang-db Allison Wang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: