Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21680

ML/MLLIB Vector compressed optimization

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0
    • Component/s: ML, MLlib
    • Labels:
      None

      Description

      When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse.
      This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse.
      When the length of the vector is large, there is significant performance difference between this two method.
      Code of Vector compressed:

        def compressed: Vector = {
          val nnz = numNonzeros
          // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 12 * nnz + 20 bytes.
          if (1.5 * (nnz + 1.0) < size) {
            toSparse
          } else {
            toDense
          }
        }
      

      I propose to change it to:

      // Some comments here
      def compressed: Vector = {
          val nnz = numNonzeros
          // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 12 * nnz + 20 bytes.
          if (1.5 * (nnz + 1.0) < size) {
            val ii = new Array[Int](nnz)
            val vv = new Array[Double](nnz)
            var k = 0
            foreachActive { (i, v) =>
              if (v != 0) {
                ii(k) = i
                vv(k) = v
              k += 1
              }
          }
          new SparseVector(size, ii, vv)
          } else {
            toDense
          }
        }
      

        Attachments

          Activity

            People

            • Assignee:
              peng.meng@intel.com Peng Meng
              Reporter:
              peng.meng@intel.com Peng Meng
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: