[SPARK-27892] Saving/loading stages in PipelineModel should be parallel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: ML
Labels:
- easyfix
- performance

Description

When a PipelineModel is saved/loaded, all the stages are saved/loaded sequentially. When dealing with a PipelineModel with many stages, although each stage's save/load takes sub-second, the total time taken for the PipelineModel could be several minutes. It should be trivial to parallelize the save/load of stages in the SharedReadWrite object.

To reproduce:

import org.apache.spark.ml._
import org.apache.spark.ml.feature.VectorAssembler
val outputPath = "..."
val stages = (1 to 100) map { i => new VectorAssembler().setInputCols(Array("input")).setOutputCol("o" + i)}
val p = new Pipeline().setStages(stages.toArray)
val data = Seq(1, 1, 1) toDF "input"
val pm = p.fit(data)
pm.save(outputPath)

Attachments

Issue Links

links to

[Github] Pull Request #29068 (Moovlin)

Activity

People

Assignee:: Unassigned

Reporter:: Jason Wang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 31/May/19 00:19

Updated:: 10/Jul/20 19:15