Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1833

Enhance svec function to accept cardinality as parameter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Implemented
    • 0.12.0
    • 0.12.1
    • classic
    • None
    • Mahout Spark Shell 0.12.0,
      Spark 1.6.0 Cluster on Hadoop Yarn 2.7.1,
      Centos7 64bit

    Description

      It will be nice to enhance the existing svec function in org.apache.mahout.math.scalabindings

        /**
         * create a sparse vector out of list of tuple2's
         * @param sdata cardinality
         * @return
         */
        def svec(sdata: TraversableOnce[(Int, AnyVal)], cardinality: Int = -1) = {
          val required = if (sdata.nonEmpty) sdata.map(_._1).max + 1 else 0
          var tmp = -1
          if (cardinality < 0) {
            tmp = required
          } else if (cardinality < required) {
            throw new IllegalArgumentException(s"Required cardinality %required but got %cardinality")
          } else {
            tmp = cardinality
          }
          val initialCapacity = sdata.size
          val sv = new RandomAccessSparseVector(tmp, initialCapacity)
          sdata.foreach(t ⇒ sv.setQuick(t._1, t._2.asInstanceOf[Number].doubleValue()))
          sv
        }
      

      So user can specify the cardinality for the created sparse vector.

      This is very useful and convenient if user wants to create a DRM with many sparse vectors and the vectors are not with the same actual size(but with the same logical size, e.g. rows of a sparse matrix).

      Below code should demonstrate the case:

      var cardinality = 20
      val rdd = sc.textFile("/some/file.txt").map(_.split(",")).map(line => (line(0).toInt, Array((line(1).toInt,1)))).reduceByKey((v1, v2) => v1 ++ v2).map(row => (row._1, svec(row._2,cardinality)))
      
      val drm = drmWrap(rdd.map(row => (row._1, row._2.asInstanceOf[Vector])))
      
      // All below element wise opperations will fail for those DRM with not cardinality-consistent SparseVector
      val drm2 = drm + drm.t
      val drm3 = drm - drm.t
      val drm4 = drm * drm.t
      val drm5 = drm / drm.t
      

      Notice that in the last map, the svec acceptted one more cardinality parameter, so the cardinality of those created sparse vectors can be consistent.

      Attachments

        Activity

          People

            resec Edmond Luo
            resec Edmond Luo
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: