[SPARK-32347] BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: SQL
Labels:
None
Environment:

Spark 3.0.0, jupyter notebook, spark launched in local[4] mode, but with Standalone cluster it also fails the same way.

Description

The bug is very easily reproduced: run the following same code in Spark 2.4.3. and in 3.0.0.

The SQL parser will raise an invalid error message with 3.0.0, although everything seems to be OK with the SQL statement and it works fine in Spark 2.4.3

import pandas as pd

pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", "BuyerName"])

df_sales = spark.createDataFrame(pdf_sales)
df_buyers = spark.createDataFrame(pdf_buyers)

df_sales.createOrReplaceTempView("df_sales")
df_buyers.createOrReplaceTempView("df_buyers")

spark.sql("""
    with b as (
        select /*+ BROADCAST(df_buyers) */
            BuyerID, BuyerName 
        from df_buyers
    )
    select 
        b.BuyerID,
        b.BuyerName,
        s.Qty
    from df_sales s
        inner join b on s.BuyerID =  b.BuyerID
""").toPandas()

The (wrong) error message:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-4-8dfe318a59ee> in <module>
22 from df_sales s
23 inner join b on s.BuyerID = b.BuyerID
---> 24 """).toPandas()

/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in sql(self, sqlQuery)
644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
645 """
--> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
647
648 @since(2.0)

/opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in _call_(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:

/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, **kw)
135 # Hide where the exception came from that shows a non-Pythonic
136 # JVM exception message.
--> 137 raise_from(converted)
138 else:
139 raise

/opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in raise_from(e)

AnalysisException: cannot resolve '`s.BuyerID`' given input columns: [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
+- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
:- SubqueryAlias s
: +- SubqueryAlias df_sales
: +- LogicalRDD BuyerID#23L, Qty#24L, false
+- SubqueryAlias b
+- Project BuyerID#27L, BuyerName#28
+- SubqueryAlias df_buyers
+- LogicalRDD BuyerID#27L, BuyerName#28, false

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2020-07-17 17_46_32-Window.png
17/Jul/20 15:12
67 kB
Ihor Bobak
2020-07-17 17_49_27-Window.png
17/Jul/20 15:12
89 kB
Ihor Bobak
2020-07-17 17_52_51-Window.png
17/Jul/20 15:12
102 kB
Ihor Bobak

Issue Links

duplicates

SPARK-32237 Cannot resolve column when put hint in the views of common table expression

Resolved

links to

[Github] Pull Request #29156 (TJX2014)

Activity

People

Assignee:: Unassigned

Reporter:: Ihor Bobak

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Jul/20 15:12

Updated:: 23/Jul/20 00:46

Resolved:: 23/Jul/20 00:46