Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30641 Project Matrix: Linear Models revisit and refactor
  3. SPARK-32061

potential regression if use memoryUsage instead of numRows

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Resolved
    • 3.1.0
    • None
    • ML, PySpark
    • None

    Description

      1, if the `memoryUsage` is improperly set, for example, too small to store a instance;

      2,  the blockify+GMM reuse two matrices whose shape is related to current blockSize:

      @transient private lazy val auxiliaryProbMat = DenseMatrix.zeros(blockSize, k)
      @transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, numFeatures) 

      When implementing blockify+GMM, I found that if I do not pre-allocate those matrices, there will be seriously regression (maybe 3~4 slower, I fogot the detailed numbers);

      3, in MLP, three pre-allocated objects are also related to numRows:

      if (ones == null || ones.length != delta.cols) ones = BDV.ones[Double](delta.cols)
      
      // TODO: allocate outputs as one big array and then create BDMs from it
      if (outputs == null || outputs(0).cols != currentBatchSize) {
      ...
      
      // TODO: allocate deltas as one big array and then create BDMs from it
      if (deltas == null || deltas(0).cols != currentBatchSize) {
        deltas = new Array[BDM[Double]](layerModels.length)
      ... 

      I am not very familiar with the impl of MLP and failed to find some related document about this pro-allocation. But I guess there maybe regression if we disable this pro-allocation, since those objects look relatively big.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: