[SPARK-24579] SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks - ASF JIRA

Details

Type: Epic
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: ML, PySpark, SQL
Labels:
- Hydrogen

Epic Name:
Project Hydrogen: Data Exchange

Description

(see attached SPIP pdf for more details)

At the crossroads of big data and AI, we see both the success of Apache Spark as a unified
analytics engine and the rise of AI frameworks like TensorFlow and Apache MXNet (incubating).
Both big data and AI are indispensable components to drive business innovation and there have
been multiple attempts from both communities to bring them together.

We saw efforts from AI community to implement data solutions for AI frameworks like tf.data and tf.Transform. However, with 50+ data sources and built-in SQL, DataFrames, and Streaming features, Spark remains the community choice for big data. This is why we saw many efforts to integrate DL/AI frameworks with Spark to leverage its power, for example, TFRecords data source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project Hydrogen, this SPIP takes a different angle at Spark + AI unification.

None of the integrations are possible without exchanging data between Spark and external DL/AI frameworks. And the performance matters. However, there doesn’t exist a standard way to exchange data and hence implementation and performance optimization fall into pieces. For example, TensorFlowOnSpark uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and save data and pass the RDD records to TensorFlow in Python. And TensorFrames converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s Java API. How can we reduce the complexity?

The proposal here is to standardize the data exchange interface (or format) between Spark and DL/AI frameworks and optimize data conversion from/to this interface. So DL/AI frameworks can leverage Spark to load data virtually from anywhere without spending extra effort building complex data solutions, like reading features from a production data warehouse or streaming model inference. Spark users can use DL/AI frameworks without learning specific data APIs implemented there. And developers from both sides can work on performance optimizations independently given the interface itself doesn’t introduce big overhead.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

[SPARK-24579] SPIP_ Standardize Optimized Data Exchange between Apache Spark and DL%2FAI Frameworks .pdf
18/Jun/18 15:23
113 kB
Xiangrui Meng

Issue Links

is related to

SPARK-27396 SPIP: Public APIs for extended Columnar Processing Support

Resolved

relates to

SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames

Resolved

SPARK-26413 SPIP: RDD Arrow Support in Spark Core and PySpark

Open

links to

Google Doc version

Activity

No work has yet been logged on this issue.

People

Assignee:: Xiangrui Meng

Reporter:: Xiangrui Meng

Votes:: 3 Vote for this issue

Watchers:: 71 Start watching this issue

Dates

Created:: 18/Jun/18 15:20

Updated:: 10/Jun/21 18:59