[SPARK-43790] Add API `copyLocalFileToHadoopFS` - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Convert to Issue

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.5.0
Fix Version/s: 3.5.0
Component/s: Connect, ML, PySpark
Labels:
None

Description

In new distributed spark ML module (designed to support spark connect and support local inference)

We need to save ML model to hadoop file system using custom binary file format, the reason is:

We often submit a spark application to spark cluster for running the training model job, we need to save trained model to hadoop file system before the spark application completes.
But we want to support local model inference, that means if we save the model by current spark DataFrame writer (e.g. parquet format), when loading model we have to rely on the spark service. But we hope we can load model without spark service. So we want the model being saved as the original binary format that our ML code can handle.

So we need to add an API like `copyLocalFileToHadoopFS`,

The implementation of `copyLocalFileToHadoopFS` could be:

(1) call `add_artifact` API to upload local file to spark driver (spark connect already support this)

(2) implement a pyspark (spark connect client) API: `copy_artifact_to_hadoop_fs`, the API sends a command to spark driver to request upload the artifact file to hadoop FS, we need to design a spark connect protobuf command message for this part. In spark driver side, when spark connect server received the request, it gets `sparkContext.hadoopConf` and then using Hadoop FileSystem API to upload file to Hadoop FS.

(3) call `copy_artifact_to_hadoop_fs` API to upload artifact file to Hadoop FS.