Description
In new distributed spark ML module (designed to support spark connect and support local inference)
We need to save ML model to hadoop file system using custom binary file format, the reason is:
- We often submit a spark application to spark cluster for running the training model job, we need to save trained model to hadoop file system before the spark application completes.
- But we want to support local model inference, that means if we save the model by current spark DataFrame writer (e.g. parquet format), when loading model we have to rely on the spark service. But we hope we can load model without spark service. So we want the model being saved as the original binary format that our ML code can handle.
So we need to add an API like `copyLocalFileToHadoopFS`,
The implementation of `copyLocalFileToHadoopFS` could be:
(1) call `add_artifact` API to upload local file to spark driver (spark connect already support this)
(2) implement a pyspark (spark connect client) API: `copy_artifact_to_hadoop_fs`, the API sends a command to spark driver to request upload the artifact file to hadoop FS, we need to design a spark connect protobuf command message for this part. In spark driver side, when spark connect server received the request, it gets `sparkContext.hadoopConf` and then using Hadoop FileSystem API to upload file to Hadoop FS.
(3) call `copy_artifact_to_hadoop_fs` API to upload artifact file to Hadoop FS.