Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22925

ml model persistence creates a lot of small files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.2, 2.2.1, 2.3.0
    • None
    • MLlib

    Description

      Today in when calling model.save(), some ML models we do makeRDD(data, 1) or repartition(1) but in some other models we don't.
      https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/impl/GLMRegressionModel.scala#L60

      In the former case issues such as SPARK-19294 have been reported for making very large single file.

      Whereas in the latter case, models such as RandomForestModel could create hundreds or thousands of very small files which is also unmanageable. Looking into this, there is no simple way to set/change spark.default.parallelism (which would be pick up by sc.parallelize) while the app is running since SparkConf seems to be copied/cached by the backend without a way to update them.
      https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala#L443
      https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L155
      https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L135

      It seems we need to have a way to make numSlice settable on a per-use basis.

      Attachments

        Activity

          People

            Unassigned Unassigned
            felixcheung Felix Cheung
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: