[SPARK-22925] ml model persistence creates a lot of small files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.1.2, 2.2.1, 2.3.0
Fix Version/s: None
Component/s: MLlib
Labels:
- bulk-closed

Description

Today in when calling model.save(), some ML models we do makeRDD(data, 1) or repartition(1) but in some other models we don't.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

In the former case issues such as ~~SPARK-19294~~ have been reported for making very large single file.

Whereas in the latter case, models such as RandomForestModel could create hundreds or thousands of very small files which is also unmanageable. Looking into this, there is no simple way to set/change spark.default.parallelism (which would be pick up by sc.parallelize) while the app is running since SparkConf seems to be copied/cached by the backend without a way to update them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

It seems we need to have a way to make numSlice settable on a per-use basis.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Felix Cheung

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Dec/17 20:27

Updated:: 08/Oct/19 05:42

Resolved:: 08/Oct/19 05:42