Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
3.0.0
-
None
-
None
-
Spark 3.0.0, jupyter notebook, spark launched in local[4] mode, but with Standalone cluster it also fails the same way.
Description
The bug is very easily reproduced: run the following same code in Spark 2.4.3. and in 3.0.0.
The SQL parser will raise an invalid error message with 3.0.0, although everything seems to be OK with the SQL statement and it works fine in Spark 2.4.3
import pandas as pd pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"]) pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", "BuyerName"]) df_sales = spark.createDataFrame(pdf_sales) df_buyers = spark.createDataFrame(pdf_buyers) df_sales.createOrReplaceTempView("df_sales") df_buyers.createOrReplaceTempView("df_buyers") spark.sql(""" with b as ( select /*+ BROADCAST(df_buyers) */ BuyerID, BuyerName from df_buyers ) select b.BuyerID, b.BuyerName, s.Qty from df_sales s inner join b on s.BuyerID = b.BuyerID """).toPandas()
The (wrong) error message:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-4-8dfe318a59ee> in <module>
22 from df_sales s
23 inner join b on s.BuyerID = b.BuyerID
---> 24 """).toPandas()
/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in sql(self, sqlQuery)
644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
645 """
--> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
647
648 @since(2.0)
/opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in _call_(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, **kw)
135 # Hide where the exception came from that shows a non-Pythonic
136 # JVM exception message.
--> 137 raise_from(converted)
138 else:
139 raise
/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in raise_from(e)
AnalysisException: cannot resolve '`s.BuyerID`' given input columns: [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
+- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
:- SubqueryAlias s
: +- SubqueryAlias df_sales
: +- LogicalRDD BuyerID#23L, Qty#24L, false
+- SubqueryAlias b
+- Project BuyerID#27L, BuyerName#28
+- SubqueryAlias df_buyers
+- LogicalRDD BuyerID#27L, BuyerName#28, false
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-32237 Cannot resolve column when put hint in the views of common table expression
- Resolved
- links to