Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32347

BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 3.0.0
    • None
    • SQL
    • None
    • Spark 3.0.0, jupyter notebook, spark launched in local[4] mode, but with Standalone cluster it also fails the same way.

       

       

    Description

      The bug is very easily reproduced: run the following same code in Spark 2.4.3. and in 3.0.0.

      The SQL parser will raise an invalid error message with 3.0.0, although everything seems to be OK with the SQL statement and it works fine in Spark 2.4.3

      import pandas as pd
      
      pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
      pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", "BuyerName"])
      
      df_sales = spark.createDataFrame(pdf_sales)
      df_buyers = spark.createDataFrame(pdf_buyers)
      
      df_sales.createOrReplaceTempView("df_sales")
      df_buyers.createOrReplaceTempView("df_buyers")
      
      spark.sql("""
          with b as (
              select /*+ BROADCAST(df_buyers) */
                  BuyerID, BuyerName 
              from df_buyers
          )
          select 
              b.BuyerID,
              b.BuyerName,
              s.Qty
          from df_sales s
              inner join b on s.BuyerID =  b.BuyerID
      """).toPandas()
      

      The (wrong) error message:
      ---------------------------------------------------------------------------
      AnalysisException Traceback (most recent call last)
      <ipython-input-4-8dfe318a59ee> in <module>
      22 from df_sales s
      23 inner join b on s.BuyerID = b.BuyerID
      ---> 24 """).toPandas()

      /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in sql(self, sqlQuery)
      644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
      645 """
      --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
      647
      648 @since(2.0)

      /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in _call_(self, *args)
      1303 answer = self.gateway_client.send_command(command)
      1304 return_value = get_return_value(
      -> 1305 answer, self.gateway_client, self.target_id, self.name)
      1306
      1307 for temp_arg in temp_args:

      /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, **kw)
      135 # Hide where the exception came from that shows a non-Pythonic
      136 # JVM exception message.
      --> 137 raise_from(converted)
      138 else:
      139 raise

      /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in raise_from(e)

      AnalysisException: cannot resolve '`s.BuyerID`' given input columns: [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
      'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
      +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
      :- SubqueryAlias s
      : +- SubqueryAlias df_sales
      : +- LogicalRDD BuyerID#23L, Qty#24L, false
      +- SubqueryAlias b
      +- Project BuyerID#27L, BuyerName#28
      +- SubqueryAlias df_buyers
      +- LogicalRDD BuyerID#27L, BuyerName#28, false

      Attachments

        1. 2020-07-17 17_46_32-Window.png
          67 kB
          Ihor Bobak
        2. 2020-07-17 17_49_27-Window.png
          89 kB
          Ihor Bobak
        3. 2020-07-17 17_52_51-Window.png
          102 kB
          Ihor Bobak

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ibobak Ihor Bobak
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: