[SPARK-43474] Add support to create DataFrame Reference in Spark connect - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.5.0
Fix Version/s: 3.5.0
Component/s: Connect, Structured Streaming
Labels:
- pull-request-available

Epic Link:
Streaming Spark Connect

Description

Add support in Spark Connect to cache a DataFrame on server side. From client side, it can create a reference to that DataFrame given the cache key.

This function will be used in streaming foreachBatch(). Server needs to call user function for every batch which takes a DataFrame as argument. With the new function, we can just cache the DataFrame on the server. Pass the id back to client which can creates the DataFrame reference. The server will replace the reference when transforming.

Attachments

Issue Links

causes

SPARK-46453 SessionHolder doesn't throw exceptions from internalError()

Resolved

SPARK-45791 Rename `SparkConnectSessionHodlerSuite.scala` to `SparkConnectSessionHolderSuite.scala`

Resolved

links to

[Github] Pull Request #41580 (rangadi)

[Github] Pull Request #41618 (rangadi)

GitHub Pull Request #41580

GitHub Pull Request #44400

(1 links to)

Activity

People

Assignee:: Raghu Angadi

Reporter:: Peng Zhong

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/May/23 00:32

Updated:: 19/Dec/23 07:25

Resolved:: 29/Jun/23 16:25