[SPARK-41379] Inconsistency of spark session in DataFrame in user function for foreachBatch sink in PySpark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.2, 3.4.0
Fix Version/s: 3.3.2, 3.4.0
Component/s: PySpark, Structured Streaming
Labels:
None

Description

https://docs.databricks.com/_static/notebooks/merge-in-streaming.html

According to some manual testing against above code example in PySpark, it seems like the property of sparkSession in given DataFrame is not the same with cloned session in streaming query. In other words, df.sparkSession does not seem to be same with the cloned spark session which you can access via df._jdf.sparkSession().

So which session to pick depends on the actual implementation of method in PySpark DataFrame, which users would never know. If it leads to pick the different session than expected, it leads to open backdoor for avoiding restrictions (e.g. AQE), unable to see session scoped resources (e.g. temp view), etc.

So it’s quite critical to sync two sessions to refer the same.

Attachments

Issue Links

links to

[Github] Pull Request #38906 (HeartSaVioR)

Activity

People

Assignee:: Jungtaek Lim

Reporter:: Jungtaek Lim

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Dec/22 23:36

Updated:: 05/Dec/22 06:01

Resolved:: 05/Dec/22 06:01