Description
This is an umbrella JIRA for adding model export/import to the spark.ml API. This JIRA is for adding the internal Saveable/Loadable API and Parquet-based format, not for other formats like PMML.
This will require the following steps:
- Add export/import for all PipelineStages supported by spark.ml
- This will include some Transformers which are not Models.
- These can use almost the same format as the spark.mllib model save/load functions, but the model metadata must store a different class name (marking the class as a spark.ml class).
- After all PipelineStages support save/load, add an interface which forces future additions to support save/load.
UPDATE: In spark.ml, we could save feature metadata using DataFrames. Other libraries and formats can support this, and it would be great if we could too. We could do either of the following:
- save() optionally takes a dataset (or schema), and load will return a (model, schema) pair.
- Models themselves save the input schema.
Both options would mean inheriting from new Saveable, Loadable types.
UPDATE: DESIGN DOC: Here's a design doc which I wrote. If you have comments about the planned implementation, please comment in this JIRA. Thanks! https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing
Attachments
Issue Links
- is related to
-
SPARK-11994 Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max
- Resolved
-
SPARK-13265 Refactoring of basic ML import/export for other file system besides HDFS
- Resolved
-
SPARK-4587 Model export/import
- Resolved
-
SPARK-5874 How to improve the current ML pipeline API?
- Resolved
-
SPARK-11939 PySpark support model export/import for Pipeline API
- Resolved
- relates to
-
SPARK-14311 Model persistence in SparkR 2.0
- Resolved